Machine Learning 案例手把手:$Tennessee\;Eastman\;Process\;Simulation\;Dataset$

透過實作田納西-伊士曼製程(Tennessee Eastman Process, TEP)資料集,我們會帶大家了解一個機器學習專案是如何從零開始實作的,培養定義問題和解決問題的能力。在這個案例中,我們會學到:

  1. 資料觀察、分析與視覺化
  2. 如何建立機器學習模型
  3. 如何評量模型成效與結果視覺化
  4. 如何改善模型成效

1.資料科學流程(Data Science Process)

(返回內容大綱...)

2.定義問題並了解資料來源

(返回內容大綱...)

  • ### Tennessee Eastman Process Simulation Dataset

    • 原始論文:Downs, James J., and Ernest F. Vogel. "A plant-wide industrial process control problem." Computers & chemical engineering 17.3 (1993): 245-255.
    • 模擬資料集來源:https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset/activity
    • 田納西-伊士曼製程(Tennessee Eastman Process, TEP)是由美國 Eastman 化學公司所建立的化工模型仿真平台所模擬的製程數據,為研究該類過程的故障診斷技術提供了一個實驗平台。其產生的資料具有時間遞移性、強耦合以及非線性特徵,因此難以得到精確的數學模型,使得模型驅動方法的性能不盡理想。數據驅動法中比較有代表性的如多變量統計,例如:主成分分析、因子分析、典型相關、聚類分析…等。
  • ### The diagram of process 整個化學反應製成主要有五個階段:反應器、冷凝器、循環壓縮機、分離器、汽提塔。主要反映過程如下:

    • A(g)+ C(g)+ D(g)→ G(liq)  產物 1
    • A(g)+ C(g)+ E(g)→ H(liq)  產物 2
    • A(g)+ E(g)→ F(liq)      副產物
    • 3D(g)→ 2F(liq)         副產物

  • ### Process Faults
Variable number Process variable Type
IDV(1) A/C feed ratio, B composition constant(stream 4) Step
IDV(2) B composition, A/C ratio constant(stream 4) Step
IDV(3) D feed temperature(stream 2) Step
IDV(4) Reactor cooling water inlet temperature Step
IDV(5) Condenser cooling water inlet temperature Step
IDV(6) A feed loss(stream 1) Step
IDV(7) C header pressure loss-reduced availability(stream 4) Step
IDV(8) A, B, C feed composition(stream 4) Random variation
IDV(9) D feed temperature(stream 2) Random variation
IDV(10) C feed temperature(stream 4) Random variation
IDV(11) Reactor cooling water inlet temperature Random variation
IDV(12) Condenser cooling water inlet temperature Random variation
IDV(13) Reaction kinetics Slow drift
IDV(14) Reactor cooling water valve Sticking
IDV(15) Condensor cooling water valve Sticking
IDV(16) Unknown Unknown
IDV(17) Unknown Unknown
IDV(18) Unknown Unknown
IDV(19) Unknown Unknown
IDV(20) Unknown Unknown

Problem Define

  • ### 利用模擬資料來預測出現哪一種操作錯誤(分類問題,OK/ Faults 1/ Fault 2/ ...)

Machine Learning task

3.撰寫程式前置作業

(返回內容大綱...)

載入所需模組與套件


【程式用法】- import package

  • import package → 匯入 package 套件
  • import package as p → 匯入 package 並重新命名為 p
  • import package.module1 as m1 → 將 package 底下的 module1 匯入,並重新命名為 m1
  • from package import modeule1 → 從 package 將 module1 匯入(module1 存在於 package 底下)

In [1]:
import numpy as np                # 矩陣操作
import pandas as pd                # 處理表格類型資料
import matplotlib.pyplot as plt          # 視覺化
import seaborn as sns               # 進階視覺化

np.set_printoptions(edgeitems=25, linewidth=150, formatter=dict(float=lambda x: "%.3g" % x))  # 設定顯示範圍

下載資料至 Colab 空間

In [2]:
# TEP_FaultFree_training_100run.csv
!gdown --id 1ckTubYilJSW9q9cmp89NGpaarqQlt2NF

# TEP_Fault_training_25run.csv
!gdown --id 1ktdevLBIeFUSRwparVpyAEE7cTlKZxDg

# TEP_Fault_training_10run.csv
!gdown --id 1zwiqv6GsnCn3jbrHXZjRJhd9iW16Eeig

# TEP_Fault_testing_10run.csv
!gdown --id 1URWWeZQyO3FhS0U_nPuzMkqEb0EfX03Y
Downloading...
From: https://drive.google.com/uc?id=1ckTubYilJSW9q9cmp89NGpaarqQlt2NF
To: /content/TEP_FaultFree_training_100run.csv
100% 24.4M/24.4M [00:00<00:00, 76.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1ktdevLBIeFUSRwparVpyAEE7cTlKZxDg
To: /content/TEP_Fault_training_25run.csv
100% 121M/121M [00:01<00:00, 100MB/s] 
Downloading...
From: https://drive.google.com/uc?id=1zwiqv6GsnCn3jbrHXZjRJhd9iW16Eeig
To: /content/TEP_Fault_training_10run.csv
100% 48.4M/48.4M [00:00<00:00, 92.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1URWWeZQyO3FhS0U_nPuzMkqEb0EfX03Y
To: /content/TEP_Fault_testing_10run.csv
100% 72.2M/72.2M [00:00<00:00, 107MB/s] 

資料集為模擬運行,採樣時間均為每 3 分鐘一筆,故每小時將有 20 筆監測資料。訓練集每回運行時間為 25 小時;測試集則為 48 小時。其中蒐集異常資料時,訓練集會以正常操作開始值至第 1 小時模擬錯誤操作,因此前 20 筆為正常資料,而後 480 筆為異常資料;測試集會以正常操作開始值至第 8 小時模擬錯誤操作,因此前 160 筆為正常資料,而後 800 筆為異常資料

  • 正常資料- train: 500 筆,test: 960 筆
  • 異常資料- train: 500 筆(前 20 筆為正常資料), test: 960 筆(前 160 筆為正常資料)

檔案說明:

  • TEP_FaultFree_training_100run.csv: 模擬 100 回正常操作訓練資料集
  • TEP_Fault_training_10run.csv: 模擬 10 回錯誤操作訓練資料集
  • TEP_Fault_testing_10run.csv: 模擬 10 回錯誤操作測試資料集

【程式用法】- arguments of function

  • function(X, a=1, b=None) → X 為必填的參數, a 預設值為 1, b 預設值為 None
    • 若未填必填參數,則會報錯;若已有預設值之參數,未填則以預設值代入計算
    • 順序可以調換,但需按照參數名稱給值。例如:function(X=array, a=2) 或 function(a=2, X=array)
    • 若未給參數名稱,則以 function 原定參數順序給值。例如: function(array, 2) → X=array, a=2, b=None 代入計算
In [3]:
def function(X, a=1, b=None):
  print(f'X={X}, a={a}, b={b}')

In [4]:
train_normal_df = pd.read_csv('TEP_FaultFree_training_100run.csv')  # 將指定名稱的 csv 檔案讀入
train_fault_df = pd.read_csv('TEP_Fault_training_10run.csv')

資料檢視

In [5]:
train_normal_df.head(n=5)       # 資料前 n 筆
Out[5]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
0 0.0 1.0 1 0.25038 3674.0 4529.0 9.2320 26.889 42.402 2704.3 74.863 120.41 0.33818 80.044 51.435 2632.9 25.029 50.528 3101.1 22.819 65.732 229.61 341.22 94.640 77.047 32.188 8.8933 26.383 6.8820 18.776 1.6567 32.958 13.823 23.978 1.2565 18.579 2.2633 4.8436 2.2986 0.017866 0.8357 0.098577 53.724 43.828 62.881 53.744 24.657 62.544 22.137 39.935 42.323 47.757 47.510 41.258 18.447
1 0.0 1.0 2 0.25109 3659.4 4556.6 9.4264 26.721 42.576 2705.0 75.000 120.41 0.33620 80.078 50.154 2633.8 24.419 48.772 3102.0 23.333 65.716 230.54 341.30 94.595 77.434 32.188 8.8933 26.383 6.8820 18.776 1.6567 32.958 13.823 23.978 1.2565 18.579 2.2633 4.8436 2.2986 0.017866 0.8357 0.098577 53.724 43.828 63.132 53.414 24.588 59.259 22.084 40.176 38.554 43.692 47.427 41.359 17.194
2 0.0 1.0 3 0.25038 3660.3 4477.8 9.4426 26.875 42.070 2706.2 74.771 120.42 0.33563 80.220 50.302 2635.5 25.244 50.071 3103.5 21.924 65.732 230.08 341.38 94.605 77.466 31.767 8.7694 26.095 6.8259 18.961 1.6292 32.985 13.742 23.897 1.3001 18.765 2.2602 4.8543 2.3900 0.017866 0.8357 0.098577 53.724 43.828 63.117 54.357 24.666 61.275 22.380 40.244 38.990 46.699 47.468 41.199 20.530
3 0.0 1.0 4 0.24977 3661.3 4512.1 9.4776 26.758 42.063 2707.2 75.224 120.39 0.33553 80.305 49.990 2635.6 23.268 50.435 3102.8 22.948 65.781 227.91 341.71 94.473 77.443 31.767 8.7694 26.095 6.8259 18.961 1.6292 32.985 13.742 23.897 1.3001 18.765 2.2602 4.8543 2.3900 0.017866 0.8357 0.098577 53.724 43.828 63.100 53.946 24.725 59.856 22.277 40.257 38.072 47.541 47.658 41.643 18.089
4 0.0 1.0 5 0.29405 3679.0 4497.0 9.3381 26.889 42.650 2705.1 75.388 120.39 0.32632 80.064 51.310 2632.4 26.099 50.480 3103.5 22.808 65.788 231.37 341.11 94.678 76.947 32.322 8.5821 26.769 6.8688 18.782 1.6396 33.071 13.834 24.228 1.0938 18.666 2.2193 4.8304 2.2416 0.017866 0.8357 0.098577 53.724 43.828 63.313 53.658 28.797 60.717 21.947 39.144 41.955 47.645 47.346 41.507 18.461
In [6]:
train_normal_df.info()       # 資料資訊:資料格式、筆數、欄位名稱及其形態等
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 55 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   faultNumber    50000 non-null  float64
 1   simulationRun  50000 non-null  float64
 2   sample         50000 non-null  int64  
 3   xmeas_1        50000 non-null  float64
 4   xmeas_2        50000 non-null  float64
 5   xmeas_3        50000 non-null  float64
 6   xmeas_4        50000 non-null  float64
 7   xmeas_5        50000 non-null  float64
 8   xmeas_6        50000 non-null  float64
 9   xmeas_7        50000 non-null  float64
 10  xmeas_8        50000 non-null  float64
 11  xmeas_9        50000 non-null  float64
 12  xmeas_10       50000 non-null  float64
 13  xmeas_11       50000 non-null  float64
 14  xmeas_12       50000 non-null  float64
 15  xmeas_13       50000 non-null  float64
 16  xmeas_14       50000 non-null  float64
 17  xmeas_15       50000 non-null  float64
 18  xmeas_16       50000 non-null  float64
 19  xmeas_17       50000 non-null  float64
 20  xmeas_18       50000 non-null  float64
 21  xmeas_19       50000 non-null  float64
 22  xmeas_20       50000 non-null  float64
 23  xmeas_21       50000 non-null  float64
 24  xmeas_22       50000 non-null  float64
 25  xmeas_23       50000 non-null  float64
 26  xmeas_24       50000 non-null  float64
 27  xmeas_25       50000 non-null  float64
 28  xmeas_26       50000 non-null  float64
 29  xmeas_27       50000 non-null  float64
 30  xmeas_28       50000 non-null  float64
 31  xmeas_29       50000 non-null  float64
 32  xmeas_30       50000 non-null  float64
 33  xmeas_31       50000 non-null  float64
 34  xmeas_32       50000 non-null  float64
 35  xmeas_33       50000 non-null  float64
 36  xmeas_34       50000 non-null  float64
 37  xmeas_35       50000 non-null  float64
 38  xmeas_36       50000 non-null  float64
 39  xmeas_37       50000 non-null  float64
 40  xmeas_38       50000 non-null  float64
 41  xmeas_39       50000 non-null  float64
 42  xmeas_40       50000 non-null  float64
 43  xmeas_41       50000 non-null  float64
 44  xmv_1          50000 non-null  float64
 45  xmv_2          50000 non-null  float64
 46  xmv_3          50000 non-null  float64
 47  xmv_4          50000 non-null  float64
 48  xmv_5          50000 non-null  float64
 49  xmv_6          50000 non-null  float64
 50  xmv_7          50000 non-null  float64
 51  xmv_8          50000 non-null  float64
 52  xmv_9          50000 non-null  float64
 53  xmv_10         50000 non-null  float64
 54  xmv_11         50000 non-null  float64
dtypes: float64(54), int64(1)
memory usage: 21.0 MB
In [7]:
train_fault_df.head(n=5)
Out[7]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
0 1 1.0 1 0.25038 3674.0 4529.0 9.2320 26.889 42.402 2704.3 74.863 120.41 0.33818 80.044 51.435 2632.9 25.029 50.528 3101.1 22.819 65.732 229.61 341.22 94.640 77.047 32.188 8.8933 26.383 6.8820 18.776 1.6567 32.958 13.823 23.978 1.2565 18.579 2.2633 4.8436 2.2986 0.017866 0.8357 0.098577 53.724 43.828 62.881 53.744 24.657 62.544 22.137 39.935 42.323 47.757 47.510 41.258 18.447
1 1 1.0 2 0.25109 3659.4 4556.6 9.4264 26.721 42.576 2705.0 75.000 120.41 0.33620 80.078 50.154 2633.8 24.419 48.772 3102.0 23.333 65.716 230.54 341.30 94.595 77.434 32.188 8.8933 26.383 6.8820 18.776 1.6567 32.958 13.823 23.978 1.2565 18.579 2.2633 4.8436 2.2986 0.017866 0.8357 0.098577 53.724 43.828 63.132 53.414 24.588 59.259 22.084 40.176 38.554 43.692 47.427 41.359 17.194
2 1 1.0 3 0.25038 3660.3 4477.8 9.4426 26.875 42.070 2706.2 74.771 120.42 0.33563 80.220 50.302 2635.5 25.244 50.071 3103.5 21.924 65.732 230.08 341.38 94.605 77.466 31.767 8.7694 26.095 6.8259 18.961 1.6292 32.985 13.742 23.897 1.3001 18.765 2.2602 4.8543 2.3900 0.017866 0.8357 0.098577 53.724 43.828 63.117 54.357 24.666 61.275 22.380 40.244 38.990 46.699 47.468 41.199 20.530
3 1 1.0 4 0.24977 3661.3 4512.1 9.4776 26.758 42.063 2707.2 75.224 120.39 0.33553 80.305 49.990 2635.6 23.268 50.435 3102.8 22.948 65.781 227.91 341.71 94.473 77.443 31.767 8.7694 26.095 6.8259 18.961 1.6292 32.985 13.742 23.897 1.3001 18.765 2.2602 4.8543 2.3900 0.017866 0.8357 0.098577 53.724 43.828 63.100 53.946 24.725 59.856 22.277 40.257 38.072 47.541 47.658 41.643 18.089
4 1 1.0 5 0.29405 3679.0 4497.0 9.3381 26.889 42.650 2705.1 75.388 120.39 0.32632 80.064 51.310 2632.4 26.099 50.480 3103.5 22.808 65.788 231.37 341.11 94.678 76.947 32.322 8.5821 26.769 6.8688 18.782 1.6396 33.071 13.834 24.228 1.0938 18.666 2.2193 4.8304 2.2416 0.017866 0.8357 0.098577 53.724 43.828 63.313 53.658 28.797 60.717 21.947 39.144 41.955 47.645 47.346 41.507 18.461

資料說明

  • faultNumber: 0 (正常樣本), 1~20(操作錯誤類型)
  • simulationRun: 第幾回合
  • sample: 回合中的第幾個樣本
  • xmeas_1~xmeas_41: 製程監控參數

    • Continuous process measurements
DescriptionVariable numberBase case valueUnits
A feed(stream 1)xmeas_10.25052kscmh
D feed(stream 2)xmeas_23664.0kgh⁻¹
E feed(stream 3)xmeas_34509.3kgh⁻¹
A and C feed(stream 4)xmeas_49.3477kscmh
Recycle flow(stream 8)xmeas_526.902kscmh
Reactor feed rate(stream 6)xmeas_642.339kscmh
Reactor pressurexmeas_72705.0kPa gauge
Reactor levelxmeas_875%
Reactor temperaturexmeas_9120.4°C
Purge rate(stream 9)xmeas_100.33712kscmh
Product separator temperaturexmeas_1180.109°C
Product separator levelxmeas_1250%
Product separator pressurexmeas_132633.7kPa gauge
Product separator underflow(stream 10)xmeas_1425.16m³h⁻¹
Stripper levelxmeas_1550%
Stripper pressurexmeas_163102.2kPa gauge
Stripper underflow(stream 11)xmeas_1722.949m³h⁻¹
Stripper temperaturexmeas_1865.731°C
Stripper steam flowxmeas_19230.31kgh⁻¹
Compressor workxmeas_20341.43kW
Reactor cooling water outlet temperaturexmeas_2194.599°C
Separator cooling water outlet temperaturexmeas_2277.297°C
    • Sampled process measurements
Reactor feed analysis(stream 6)
ComponentVariable numberBase case valueUnitsSampling frequency=0.1h
Axmeas_2332.188mol%
Bxmeas_248.8933mol%
Cxmeas_2526.383mol%
Dxmeas_2626.383mol%
Exmeas_2726.383mol%
Fxmeas_2826.383mol%
Purge gas analysis(stream 9)
ComponentVariable numberBase case valueUnitsSampling frequency=0.1h
Axmeas_2932.958mol%
Bxmeas_3013.823mol%
Cxmeas_3123.978mol%
Dxmeas_321.2565mol%
Exmeas_3318.579mol%
Fxmeas_342.2633mol%
Gxmeas_354.8436mol%
Hxmeas_362.2986mol%
Product analysis(stream 11)
ComponentVariable numberBase case valueUnitsSampling frequency=0.25h
Dxmeas_370.01787mol%
Exmeas_380.83570mol%
Fxmeas_390.09858mol%
Gxmeas_4053.724mol%
Hxmeas_4143.828mol%
  • xmv_1~xmv_11: 製程操作參數
DescriptionVariable numberBase case value(%)Low limitHigh limitUnits
D feed flow(stream 2)xmv_163.05305811kgh⁻¹
E feed flow(stream 3)xmv_253.98008354kgh⁻¹
A feed flow(stream 1)xmv_324.64401.017kscmh
A and C feed flow(stream 4)xmv_461.302015.25kscmh
Compressor recycle valvexmv_522.2100100%
Purge valve(stream 9)xmv_640.0640100%
Separator pot liquid flow(stream 10)xmv_738.100065.71m³h⁻¹
Stripper liquid product flow(stream 11)xmv_846.534049.10m³h⁻¹
Stripper steam valvexmv_947.4460100%
Reactor cooling water flowxmv_1041.1060227.1m³h⁻¹
Condenser cooling water flowxmv_1118.1140272.6m³h⁻¹

對於此資料集 X 以及 y 分別是:

  • X: xmeas_1, xmeas_2, xmeas_3, ..., xmv_1, xmv_2, ...
  • y: faultNumber
In [8]:
train_normal_df.describe()  # 資料描述:預設針對數值型態欄位統計相關數據,包含計數、平均值、標準差、最小、第一四分位距、第二四分位距、第三四分位距、最大值等。
Out[8]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 50000.0 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.00000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 0.0 50.500000 250.500000 0.250489 3663.616392 4508.981474 9.347079 26.900955 42.337796 2705.069408 75.003094 120.400020 0.337083 80.105536 49.999391 2633.797418 25.163457 50.000156 3102.269580 22.950367 65.80472 232.306067 341.418245 94.600724 77.293118 32.187174 8.892558 26.386004 6.882060 18.775477 1.656792 32.959006 13.822784 23.979229 1.256273 18.579226 2.263261 4.843595 2.298232 0.017881 0.835798 0.098704 53.712516 43.829155 63.044609 53.977376 24.640612 61.294671 22.213640 40.059964 38.098213 46.534364 47.993337 41.103554 18.114315
std 0.0 28.866359 144.338722 0.031040 34.175783 39.246980 0.086483 0.211427 0.220018 7.701423 0.546435 0.019139 0.012513 0.242917 1.005700 8.060942 1.017879 1.018949 6.604711 0.616667 0.42762 10.496295 1.708033 0.133443 0.261435 0.291841 0.103620 0.315884 0.108585 0.294791 0.025604 0.340328 0.108189 0.386353 0.101335 0.340981 0.026653 0.066945 0.053589 0.009937 0.018734 0.010211 0.501059 0.508716 0.588010 0.471831 3.054117 1.241120 0.548256 1.529840 2.959632 2.358139 2.746986 0.542482 1.463330
min 0.0 1.000000 1.000000 0.131300 3516.700000 4354.800000 9.024700 26.099000 41.472000 2678.900000 72.936000 120.310000 0.287120 79.046000 46.121000 2606.400000 20.752000 46.006000 3080.100000 20.448000 64.19300 190.230000 335.170000 94.074000 76.133000 30.970000 8.512900 25.241000 6.426000 17.685000 1.558000 31.635000 13.411000 22.537000 0.823720 17.288000 2.172700 4.561600 2.095800 -0.014288 0.765020 0.061483 51.934000 42.121000 60.806000 52.095000 12.610000 56.078000 20.128000 34.204000 26.683000 37.290000 37.852000 38.586000 12.543000
25% 0.0 25.750000 125.750000 0.229850 3640.500000 4482.700000 9.288975 26.757000 42.187000 2699.900000 74.630000 120.390000 0.328650 79.944000 49.314000 2628.400000 24.476000 49.307000 3097.800000 22.530000 65.53900 226.070000 340.310000 94.510000 77.115000 31.992000 8.821700 26.173000 6.807900 18.576000 1.639500 32.729000 13.750000 23.721000 1.187375 18.352000 2.245100 4.798500 2.262400 0.011184 0.823765 0.091607 53.362000 43.480000 62.646000 53.660000 22.608000 60.461000 21.849000 39.024000 36.081000 44.930000 46.320000 40.738000 17.119000
50% 0.0 50.500000 250.500000 0.250700 3663.400000 4508.800000 9.347300 26.901000 42.338000 2705.000000 75.004000 120.400000 0.337120 80.107000 49.997000 2633.800000 25.161000 49.999000 3102.200000 22.950000 65.79400 231.910000 341.410000 94.601000 77.293000 32.188000 8.893300 26.385000 6.882400 18.776000 1.656700 32.959000 13.823000 23.978000 1.256500 18.580500 2.263300 4.843600 2.298400 0.017866 0.835700 0.098577 53.719000 43.828000 63.041000 53.979000 24.661000 61.296000 22.207000 40.061000 38.090000 46.532000 47.851000 41.104000 18.116500
75% 0.0 75.250000 375.250000 0.271250 3686.800000 4535.300000 9.405600 27.045000 42.488000 2710.100000 75.376000 120.410000 0.345480 80.269000 50.684000 2639.100000 25.851000 50.694000 3106.600000 23.370000 66.07400 238.890000 342.510000 94.691000 77.468000 32.383000 8.963100 26.599000 6.955700 18.972000 1.674300 33.187000 13.897000 24.241000 1.324425 18.810000 2.281500 4.888800 2.334200 0.024643 0.848193 0.105660 54.054000 44.177000 63.445000 54.294000 26.685250 62.140250 22.571000 41.088000 40.112000 48.141000 49.706000 41.470000 19.109000
max 0.0 100.000000 500.000000 0.385090 3800.900000 4663.800000 9.664900 27.785000 43.257000 2739.100000 77.514000 120.480000 0.386220 81.062000 53.972000 2669.500000 29.560000 53.940000 3131.300000 25.292000 67.43800 272.530000 347.440000 95.206000 78.396000 33.395000 9.281200 27.594000 7.256200 19.917000 1.748600 34.365000 14.245000 25.496000 1.641400 19.975000 2.358100 5.109200 2.519100 0.055001 0.901320 0.136340 55.614000 45.979000 65.606000 55.778000 37.586000 66.115000 24.525000 46.228000 49.789000 55.652000 58.843000 43.354000 24.126000

TEP_FaultFree_training_100run.csv 僅存在 faultNumber=0(Normal),而各個監測值的值域範圍差異極大。另外,不存在著欄位中僅有一種值。

In [9]:
train_fault_df.describe()  # 資料描述:預設針對數值型態欄位統計相關數據,包含計數、平均值、標準差、最小、第一四分位距、第二四分位距、第三四分位距、最大值等。
Out[9]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 100000.00000 100000.000000 100000.000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000
mean 10.50000 5.500000 250.500 0.262878 3663.100170 4503.055963 9.360392 26.899707 42.357874 2720.903245 74.905369 120.400125 0.345772 79.787591 49.991553 2648.715737 25.113656 49.951085 3119.610870 22.940807 65.980187 244.737721 340.044355 94.455020 77.009835 31.931802 8.884639 26.771830 6.876755 18.743143 1.629022 32.566381 13.805995 24.576261 1.256212 18.514626 2.223339 4.788143 2.267549 0.018231 0.842360 0.097872 53.749683 43.771922 63.322978 54.058462 30.582501 63.057786 22.054165 40.079177 38.075138 46.420809 50.874287 42.146655 18.646561
std 5.76631 2.872296 144.338 0.148607 41.729564 108.054861 0.337020 0.232302 0.306196 72.406338 1.295175 0.073015 0.086393 1.734592 1.006893 72.622086 1.071766 1.032620 76.645408 0.643270 1.881441 68.216192 10.702521 1.271662 1.464903 1.731546 0.212275 1.964461 0.130543 0.866516 0.128807 2.598549 0.278500 3.030150 0.138395 1.210390 0.176026 0.338088 0.182797 0.010239 0.087896 0.013494 0.571728 0.603568 2.376708 3.859681 20.366145 6.931230 8.022690 12.585373 2.963142 2.389769 17.105399 9.614843 4.097332
min 1.00000 1.000000 1.000 -0.003999 3383.500000 3738.200000 7.195900 25.434000 40.444000 2449.300000 63.913000 119.650000 0.055312 68.297000 45.591000 2359.000000 19.844000 45.541000 2893.600000 19.725000 55.305000 -2.368600 255.130000 80.309000 63.684000 23.663000 7.443800 18.629000 6.101100 12.637000 0.932660 20.229000 12.262000 12.324000 0.415950 10.420000 1.297500 3.224700 1.418500 -0.019730 0.398470 0.027588 51.209000 41.022000 58.416000 35.891000 -0.056625 46.231000 -0.040454 0.000000 25.125000 36.215000 -0.268380 -0.210000 -0.002106
25% 5.75000 3.000000 125.750 0.218638 3636.300000 4467.900000 9.260800 26.758000 42.177000 2697.600000 74.436000 120.380000 0.322590 79.780000 49.307000 2625.000000 24.390000 49.274000 3096.200000 22.514000 65.372000 222.650000 338.980000 94.432000 76.962000 31.882000 8.795900 26.128000 6.798300 18.506000 1.628000 32.561000 13.708000 23.667000 1.181900 18.245000 2.231500 4.769200 2.243000 0.010865 0.817130 0.090307 53.371000 43.403000 62.619000 53.478000 22.350000 60.296750 21.358000 38.281000 36.061000 44.853000 45.431000 40.581000 17.150000
50% 10.50000 5.500000 250.500 0.251500 3662.300000 4507.800000 9.354800 26.897000 42.344000 2706.100000 74.959000 120.400000 0.336290 80.076000 50.018000 2634.500000 25.125500 49.908000 3103.200000 22.936000 65.821000 233.200000 341.240000 94.597000 77.259000 32.178000 8.888300 26.432000 6.881900 18.778000 1.653700 32.924000 13.816000 24.040000 1.258900 18.571000 2.261100 4.841200 2.296900 0.018308 0.836765 0.098577 53.735000 43.797000 63.080000 53.970500 25.235000 61.472000 22.132000 39.962000 38.153000 46.322000 48.116500 41.205000 18.292000
75% 15.25000 8.000000 375.250 0.283720 3689.900000 4546.700000 9.455600 27.043000 42.521000 2715.500000 75.455000 120.420000 0.348450 80.361000 50.660000 2644.400000 25.816250 50.650000 3111.300000 23.372000 66.520000 249.340000 343.000000 94.759000 77.537000 32.444000 8.977500 26.757000 6.959900 19.058000 1.677700 33.256000 13.913000 24.474000 1.338100 18.889000 2.287400 4.907625 2.346100 0.025501 0.858348 0.106390 54.113000 44.178000 63.551250 54.446000 28.897000 62.807000 22.740000 41.486000 40.043000 48.039000 52.155000 41.958000 19.520000
max 20.00000 10.000000 500.000 1.012100 3883.300000 5111.800000 11.777000 28.212000 44.209000 3000.200000 85.122000 120.920000 0.803380 87.051000 54.204000 2945.400000 31.263000 54.498000 3448.700000 26.654000 74.359000 463.460000 392.400000 98.623000 82.555000 39.481000 9.992500 36.255000 7.710200 25.294000 2.140300 43.956000 15.380000 38.829000 2.210200 27.568000 2.966100 6.300600 2.973800 0.057655 1.526600 0.173690 56.024000 46.442000 100.000000 100.000000 100.100000 100.010000 100.050000 95.729000 50.470000 56.943000 100.380000 100.160000 87.332000

TEP_Fault_training_10run.csv 中 faultNumber=1~20(Process Faults)

5.資料的初步觀察和前處理

(返回內容大綱...)

資料視覺化

取一小部分的資料來做比較:Normal data v.s. Fault data

【程式用法】- slicing

  • array[start:end:step] → 取用從 start(default=0)到 end(default=-1)的元素,間隔為 step(default=1)
In [10]:
a = list(range(0, 15))
print('a:\t ', a)
print('a[:]\t ', a[:])
print('a[:2]:\t ', a[:2])
print('a[2:6]:\t ', a[2:6])
print('a[1:]:\t ', a[1:])
print('a[2:8:2]:', a[2:8:2])
a:	  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
a[:]	  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
a[:2]:	  [0, 1]
a[2:6]:	  [2, 3, 4, 5]
a[1:]:	  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
a[2:8:2]: [2, 4, 6]

【程式用法】- pd.DataFrame

  • 取用其中指定名稱的欄位 → df[column_name] 或 df.column_name
  • 取用其中指定索引的筆數 → df.loc[row_name, :] 或 df.iloc[row_num, :]
  • df.loc[row_name, column_name](loc: location, 填入名稱) v.s. df.iloc[row_num, column_num](iloc: integer-location, 填入數字)

【程式用法】 - conditional statement

輸出為 boolean (布林值):True / False

  • a == b → a 是否等於 b
  • a != b → a 是否不等於 b
  • a in b → a 是否是 b 的元素
  • a and b → a 和 b 同時成立
  • a or b → a 或 b 成立
  • ...

結合 array 索引 → 只取用為 True 的索引資料

In [11]:
a = np.arange(10)
print('a:\t', a)
print('a>5:\t', a>5)
print('a[a>5]:\t', a[a>5])
a:	 [0 1 2 3 4 5 6 7 8 9]
a>5:	 [False False False False False False  True  True  True  True]
a[a>5]:	 [6 7 8 9]

In [12]:
sample_train_normal = train_normal_df[train_normal_df.simulationRun==1]  # 透過條件限制 simulationRun=1 取用部分資料
sample_train_fault = train_fault_df[train_fault_df.simulationRun==1]
  • FaultNumber = 1 and 2

【程式用法】- for loop

In [13]:
for idx, i in enumerate(range(6, 12)):
  print(f'{idx+1} th loop: {i}')
1 th loop: 6
2 th loop: 7
3 th loop: 8
4 th loop: 9
5 th loop: 10
6 th loop: 11

In [14]:
plt.figure(figsize=(15, 20))  # 設定畫布大小

for idx, i in enumerate(range(0, 5)): # 查看第 i 個特徵的分布
  plt.subplot(5, 1, idx+1)  # 在 5 列 1 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log

  plt.plot(sample_train_normal['sample'],
           sample_train_normal.iloc[:, i+3],
           label='Normal')  # 繪製折線圖

  for j in range(2):  # 查看第 j 種 process fault
    plt.plot(sample_train_fault.loc[sample_train_fault.faultNumber==j+1, 'sample'],
             sample_train_fault.loc[sample_train_fault.faultNumber==j+1, f'xmeas_{i+1}'],
             label=f'Fault_{j+1}')  # 繪製折線圖
    
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.title(sample_train_normal.columns[i+3])  # 設定標題
  plt.tight_layout()  # 自動保持子圖之間的間距
In [15]:
plt.figure(figsize=(15, 20))  # 設定畫布大小

for idx, i in enumerate(range(0, 5)):  # 查看第 i 個特徵的分布
  plt.subplot(5, 1, idx+1)  # 在 5 列 1 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log

  plt.plot(sample_train_normal['sample'],
           sample_train_normal.iloc[:, i+3],
           label='Normal')  # 繪製折線圖

  for j in range(2, 4):
    plt.plot(sample_train_fault.loc[sample_train_fault.faultNumber==(j+1), 'sample'],
             sample_train_fault.loc[sample_train_fault.faultNumber==(j+1), f'xmeas_{i+1}'],
             label=f'Fault_{j+1}')  # 繪製折線圖
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.title(sample_train_normal.columns[i+3])  # 設定標題
  plt.tight_layout()  # 自動保持子圖之間的間距

先觀察少部分的資料:對於 faultNumber = 3 and 4 與 Normal 的前 5 個特徵分布十分相近,可以再進一步觀察其他特徵的分布差異並分析此分類是否存在問題。

In [16]:
comb_df = pd.concat((sample_train_normal.iloc[20:, :], sample_train_fault[sample_train_fault['sample']>=20]))  # 合併 normal 和 fault data
comb_df['faultNumber'] = comb_df['faultNumber'].astype('int')  # 將欄位 faultNumber 轉換為 int(整數)型態
  • 盒型圖(Boxplot)
In [17]:
plt.figure(figsize=(20, 12))  # 設定畫布大小

for idx, i in enumerate(range(3)):  # 查看第 i 個特徵分布
  plt.subplot(3, 1, idx+1)  # 在 3 列 1 行的畫布上,選擇編號為 idx+1 的位置
  sns.boxplot(data=comb_df[['faultNumber', f'xmeas_{i+1}']],
              x='faultNumber',
              y=f'xmeas_{i+1}')  # 繪製盒鬚圖
  plt.tight_layout()  # 自動保持子圖之間的間距
  • 小提琴圖(Violinplot)
In [18]:
plt.figure(figsize=(25, 12))  # 設定畫布大小
for idx, i in enumerate(range(3)):  # 查看第 i 個特徵分布
  plt.subplot(3, 1, idx+1)  # 在 3 列 1 行的畫布上,選擇編號為 idx+1 的位置
  sns.violinplot(data=comb_df[['faultNumber', f'xmeas_{i+1}']],
                 x='faultNumber',
                 y=f'xmeas_{i+1}')  # 繪製小提琴圖
  plt.tight_layout()  # 自動保持子圖之間的間距
  • 相關係數熱圖
In [19]:
corr = comb_df.iloc[:,[0]+[i for i in range(3, 55)]].corr(method='kendall')  # 計算 faultNumber 以及 xmeas, xmv 間的相關係數
plt.figure(figsize=(30, 20))  # 設定畫布大小
sns.heatmap(corr)  # 繪製熱圖(heatmap)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f15f6eaa450>
In [20]:
comb_df['faultNumber'] = comb_df['faultNumber'].astype('str')  # 將欄位 faultNumber 轉換為 str(字串)型態
new_comb_df = pd.get_dummies(comb_df, columns=['faultNumber'])  # 將 faultNumber 拆成 20 個欄位(One-hot encoding:屬於該類別,給值 1 反之為 0)
In [21]:
new_comb_df.head()
Out[21]:
simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11 faultNumber_0 faultNumber_1 faultNumber_10 faultNumber_11 faultNumber_12 faultNumber_13 faultNumber_14 faultNumber_15 faultNumber_16 faultNumber_17 faultNumber_18 faultNumber_19 faultNumber_2 faultNumber_20 faultNumber_3 faultNumber_4 faultNumber_5 faultNumber_6 faultNumber_7 faultNumber_8 faultNumber_9
20 1.0 21 0.27833 3649.7 4479.9 9.3486 26.387 42.564 2701.5 75.073 120.40 0.33729 80.384 50.172 2630.3 27.059 50.066 3097.6 22.868 65.649 226.86 341.45 94.518 77.321 32.294 9.0822 26.056 6.9624 18.749 1.6767 33.065 13.966 23.520 1.3563 18.466 2.2213 4.8492 2.2320 0.018032 0.87043 0.10962 53.559 43.529 63.137 53.947 27.761 60.589 21.743 39.398 38.607 46.686 46.688 41.585 18.294 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 1.0 22 0.27992 3647.2 4516.1 9.4227 27.093 42.452 2701.2 74.917 120.41 0.33888 80.229 49.531 2629.9 24.622 51.106 3097.5 22.978 65.679 225.78 341.07 94.626 77.430 32.294 9.0822 26.056 6.9624 18.749 1.6767 33.065 13.966 23.520 1.3563 18.466 2.2213 4.8492 2.2320 0.018032 0.87043 0.10962 53.559 43.529 63.179 53.514 27.605 61.209 22.133 39.204 36.719 49.093 46.898 41.303 18.034 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22 1.0 23 0.24735 3648.7 4505.7 9.4650 26.761 42.765 2699.8 74.646 120.40 0.35154 80.175 52.351 2628.2 27.138 50.364 3096.8 23.114 65.707 227.59 341.40 94.625 77.243 32.415 8.9563 26.225 6.6925 18.938 1.6194 32.598 13.782 23.943 1.3275 18.333 2.2519 4.8318 2.3250 0.018032 0.87043 0.10962 53.559 43.529 62.886 53.982 24.505 61.251 22.177 42.071 45.020 47.376 46.854 40.988 17.724 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
23 1.0 24 0.24847 3677.2 4491.0 9.3631 26.880 42.538 2699.6 74.685 120.39 0.35221 80.139 48.538 2628.1 26.368 50.855 3096.5 22.501 65.683 228.07 341.47 94.695 77.259 32.415 8.9563 26.225 6.6925 18.938 1.6194 32.598 13.782 23.943 1.3275 18.333 2.2519 4.8318 2.3250 0.018032 0.87043 0.10962 53.559 43.529 62.396 54.157 24.396 61.648 21.921 41.989 33.798 48.512 46.931 40.832 19.179 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 1.0 25 0.23392 3732.0 4481.1 9.3416 26.901 42.089 2700.5 75.605 120.42 0.33267 80.246 50.045 2629.0 25.334 49.984 3097.3 21.802 65.689 227.78 341.38 94.569 77.377 32.247 9.0295 25.723 6.9583 18.586 1.6303 33.174 13.932 23.609 1.3523 18.964 2.2519 4.8840 2.2843 0.018032 0.87043 0.10962 53.559 43.529 63.871 53.801 23.173 59.540 22.188 39.527 38.233 46.496 47.071 41.548 20.821 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [22]:
corr = new_comb_df.iloc[:,2:].corr(method='kendall')  # 計算各種 faultNumber 以及 xmeas, xmv 間的相關係數
plt.figure(figsize=(30, 20))  # 繪製畫布大小
sns.heatmap(corr)  # 繪製熱圖
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f15f8e261d0>

6.將資料建構成可以給模型的資料格式

(返回內容大綱...)

  • 所有的特徵都轉換為數字 → 模型才能做運算
  • 特徵不能包含目標欄位或從目標欄位做轉換而來 → 避免模型偷看答案
  • 考慮實際的流程,新一筆資料進來不具有目標值時,能蒐集到的特徵是否與訓練集一致 → 測試資料需要做的資料處理與訓練資料一致

7.資料切分為訓練集和驗證集

(返回內容大綱...)

In [23]:
train_all_10run = pd.concat((train_normal_df[train_normal_df['simulationRun']<=10], train_fault_df[train_fault_df['sample']>20]))  # 合併 normal 和 fault data
In [24]:
train_all_10run.describe()
Out[24]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 101000.000000 101000.000000 101000.00000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000
mean 9.980198 5.500000 260.00495 0.262804 3663.178179 4503.214891 9.360069 26.899056 42.357745 2720.752178 74.908412 120.400096 0.345663 79.790915 49.990765 2648.574014 25.110328 49.949112 3119.447944 22.940343 65.980934 244.679638 340.053394 94.456302 77.012846 31.934102 8.884243 26.768599 6.876507 18.742404 1.629313 32.569461 13.805908 24.569678 1.256861 18.515615 2.223707 4.788594 2.267739 0.018188 0.842274 0.097784 53.750068 43.775112 63.319831 54.059600 30.527387 63.048547 22.055053 40.078360 38.072819 46.416243 50.865050 42.135156 18.642896
std 6.065645 2.872296 138.87286 0.147955 41.819392 107.717276 0.335644 0.231183 0.305365 72.083321 1.290699 0.072700 0.085998 1.726930 1.007209 72.298031 1.067460 1.034812 76.295683 0.642089 1.874903 67.925725 10.657147 1.265589 1.458236 1.723630 0.211569 1.955593 0.130426 0.863641 0.128246 2.586441 0.277607 3.016748 0.138770 1.206004 0.175228 0.336655 0.182061 0.010322 0.087569 0.013491 0.572373 0.606581 2.366096 3.841146 20.278465 6.898345 7.983770 12.525716 2.964073 2.394845 17.033037 9.568130 4.078899
min 0.000000 1.000000 1.00000 -0.003999 3383.500000 3738.200000 7.195900 25.434000 40.444000 2449.300000 63.913000 119.650000 0.055312 68.297000 45.591000 2359.000000 19.844000 45.541000 2893.600000 19.725000 55.305000 -2.368600 255.130000 80.309000 63.684000 23.663000 7.443800 18.629000 6.101100 12.637000 0.932660 20.229000 12.262000 12.324000 0.415950 10.420000 1.297500 3.224700 1.418500 -0.019730 0.398470 0.027588 51.209000 41.022000 58.416000 35.891000 -0.056625 46.231000 -0.040454 0.000000 25.125000 36.215000 -0.268380 -0.210000 -0.002106
25% 5.000000 3.000000 140.00000 0.218450 3636.100000 4467.500000 9.259600 26.759000 42.178000 2697.300000 74.439000 120.380000 0.322470 79.776000 49.310000 2624.800000 24.388000 49.266000 3096.000000 22.512000 65.356000 222.210000 338.950000 94.431000 76.965750 31.884000 8.795300 26.128000 6.797600 18.502000 1.627900 32.558000 13.707000 23.658000 1.181800 18.242000 2.231700 4.768200 2.242500 0.010688 0.816790 0.090034 53.368000 43.399000 62.616000 53.478000 22.318000 60.317000 21.355000 38.268000 36.068000 44.835000 45.322000 40.580000 17.142000
50% 10.000000 5.500000 260.00000 0.251900 3662.500000 4508.000000 9.354700 26.897000 42.344000 2706.300000 74.963000 120.400000 0.336090 80.077000 50.016000 2634.600000 25.126000 49.904000 3103.200000 22.936000 65.847000 233.750000 341.180000 94.595000 77.259000 32.175000 8.887000 26.434000 6.881100 18.777000 1.653700 32.921000 13.815000 24.039000 1.260200 18.568000 2.260800 4.840700 2.296300 0.018454 0.836905 0.098408 53.745000 43.795500 63.078000 53.973000 25.293000 61.482000 22.120000 39.950000 38.146000 46.312000 48.272000 41.200000 18.292000
75% 15.000000 8.000000 380.00000 0.283750 3690.000000 4547.300000 9.456000 27.040000 42.521000 2715.700000 75.456000 120.420000 0.348560 80.365000 50.663000 2644.600000 25.814000 50.652000 3111.400000 23.375000 66.522000 249.440000 343.080000 94.759000 77.536000 32.445000 8.977000 26.758000 6.960100 19.062000 1.677800 33.255000 13.914000 24.479000 1.339600 18.894000 2.287400 4.908500 2.346100 0.025559 0.858940 0.106460 54.112000 44.187000 63.552000 54.451000 28.895000 62.803000 22.758000 41.500000 40.050000 48.043000 52.190000 41.954000 19.516250
max 20.000000 10.000000 500.00000 1.012100 3883.300000 5111.800000 11.777000 28.212000 44.209000 3000.200000 85.122000 120.920000 0.803380 87.051000 54.204000 2945.400000 31.263000 54.498000 3448.700000 26.654000 74.359000 463.460000 392.400000 98.623000 82.555000 39.481000 9.992500 36.255000 7.710200 25.294000 2.140300 43.956000 15.380000 38.829000 2.210200 27.568000 2.966100 6.300600 2.973800 0.057655 1.526600 0.173690 56.024000 46.442000 100.000000 100.000000 100.100000 100.010000 100.050000 95.729000 50.470000 56.943000 100.380000 100.160000 87.332000
In [25]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train_all_10run.iloc[:, 3:].values,  # 按給定比例切分資料集
                              train_all_10run['faultNumber'].values,
                              test_size=0.25,  # 切分比例
                              random_state=17)  # 隨機方式
                              #, stratify=train_all_10run['faultNumber'].values)  # 按照 faultNumber 各類比例
In [26]:
print(f'The shape of X_train: {X_train.shape}\t y_train: {y_train.shape}')
print(f'The shape of X_valid: {X_train.shape}\t y_valid: {y_valid.shape}')
The shape of X_train: (75750, 52)	 y_train: (75750,)
The shape of X_valid: (75750, 52)	 y_valid: (25250,)
In [27]:
unique, counts = np.unique(train_all_10run['faultNumber'].values, return_counts=True)  # 計算各種 fault 有多少筆
plt.bar(unique, counts)  # 繪製直條圖
Out[27]:
<BarContainer object of 21 artists>
In [28]:
unique, counts = np.unique(y_train, return_counts=True)  # 計算各種 fault 有多少筆
plt.bar(unique, counts)  # 繪製直條圖
Out[28]:
<BarContainer object of 21 artists>

8.訓練模型:從最簡單的分類模型開始

(返回內容大綱...)


【程式用法】- Sklearn model

  • 建立模型 → model = MODEL()
  • 訓練模型 → model.fit(X, y)
  • 預測結果 → model.predict(X)

In [29]:
from sklearn.linear_model import LogisticRegression
In [30]:
model = LogisticRegression()  # 建立模型
model.fit(X_train, y_train)  # 訓練模型
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Out[30]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

9.評估剛剛建立好的模型

(返回內容大綱...)

In [31]:
from sklearn.metrics import mean_squared_error, classification_report, confusion_matrix
  • 訓練集上的結果
In [32]:
# model.score(X_train, y_train)

pred_train = model.predict(X_train)  # 預測結果
print(classification_report(y_train, pred_train))  # 評估結果
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
              precision    recall  f1-score   support

         0.0       0.04      0.00      0.01      3742
         1.0       0.72      0.93      0.81      3560
         2.0       0.73      0.90      0.81      3623
         3.0       0.00      0.00      0.00      3574
         4.0       0.14      0.20      0.17      3552
         5.0       0.10      0.36      0.16      3598
         6.0       0.90      0.87      0.88      3603
         7.0       0.93      0.73      0.81      3603
         8.0       0.25      0.11      0.15      3659
         9.0       0.00      0.00      0.00      3574
        10.0       0.00      0.00      0.00      3532
        11.0       0.00      0.00      0.00      3633
        12.0       0.06      0.26      0.10      3663
        13.0       0.06      0.12      0.08      3641
        14.0       0.00      0.00      0.00      3607
        15.0       0.00      0.00      0.00      3577
        16.0       0.00      0.00      0.00      3611
        17.0       0.17      0.30      0.22      3564
        18.0       0.36      0.35      0.36      3566
        19.0       0.00      0.00      0.00      3670
        20.0       0.26      0.57      0.36      3598

    accuracy                           0.27     75750
   macro avg       0.23      0.27      0.23     75750
weighted avg       0.22      0.27      0.23     75750

  • 驗證集上的結果
In [33]:
# model.score(X_valid, y_valid)

pred_valid = model.predict(X_valid)  # 預測結果
print(classification_report(y_valid, pred_valid))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.04      0.00      0.01      1258
         1.0       0.76      0.95      0.84      1240
         2.0       0.75      0.89      0.81      1177
         3.0       0.00      0.00      0.00      1226
         4.0       0.16      0.21      0.18      1248
         5.0       0.11      0.40      0.17      1202
         6.0       0.89      0.87      0.88      1197
         7.0       0.92      0.72      0.81      1197
         8.0       0.23      0.11      0.15      1141
         9.0       0.00      0.00      0.00      1226
        10.0       0.00      0.00      0.00      1268
        11.0       0.00      0.00      0.00      1167
        12.0       0.06      0.26      0.09      1137
        13.0       0.06      0.13      0.08      1159
        14.0       0.00      0.00      0.00      1193
        15.0       0.00      0.00      0.00      1223
        16.0       0.00      0.00      0.00      1189
        17.0       0.17      0.28      0.21      1236
        18.0       0.39      0.36      0.38      1234
        19.0       0.00      0.00      0.00      1130
        20.0       0.28      0.58      0.38      1202

    accuracy                           0.27     25250
   macro avg       0.23      0.27      0.24     25250
weighted avg       0.23      0.27      0.24     25250

/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
In [34]:
print(confusion_matrix(y_valid, pred_valid))  # 列出混淆矩陣(Confusion matrix)
[[   6    0    0    0  144  228    0    4   35    0    0    0  398  169    0    0    0  159    9    0  106]
 [   0 1172    0    0    3   19   36    0    0    0    0    0    5    1    0    0    0    2    0    0    2]
 [   0    0 1050    0   10   47    0    1    0    0    0    0   42    5    0    0    0   15    2    0    5]
 [   3    0    0    0  113  257    0    5   35    0    0    0  341  183    0    0    0  163   11    0  115]
 [   0    0    0    0  266  189    0    6   24    0    0    0  334  122    0    0    0  197   12    0   98]
 [   0    0    0    0   77  480    1    5   29    0    0    0  287  139    0    0    0   78   10    0   96]
 [   0  140    0    0    1    4 1039    0    0    0    0    0    3    0    0    0    0    2    0    0    8]
 [   0   15   30    0    0  112    0  866    0    0    0    0   39   39    0    0    0    3   25    0   68]
 [   1  110   71    0   17  295   51    1  121    0    0    0  145  128    0    0    0   43   87    0   71]
 [   2    0    0    0  138  238    0    3   24    0    0    0  376  185    0    0    0  146    7    0  107]
 [   4    1    0    0   85  378    0    6   35    0    0    0  343  228    0    0    0   97   12    0   79]
 [  43    0    0    0  114  218    0    6   31    0    0    0  350  155    0    0    0  149    9    0   92]
 [   2   18   20    0   17  223    1    5   20    0    0    0  291  153    0    0    0   20  167    0  200]
 [   3   63  183    0   17  150   29    1   10    0    0    0  106  145    0    0    0   23  293    0  136]
 [  65    0    0    0  182  191    0    6   22    0    0    0  342  146    0    0    0  135   10    0   94]
 [   3    0    0    0  108  251    0    2   33    0    0    0  357  201    0    0    0  153    8    0  107]
 [   3    0    0    0   85  310    0    6   33    0    0    0  361  158    0    0    0  121   16    0   96]
 [   3    0    0    0  112  192    0    8   26    0    0    0  335  126    0    0    0  347    6    0   81]
 [   0   31   47    1   31  215   11    0    5    0    0    0  154  109    0    0    0   37  447    0  146]
 [   8    0    0    0  113  200    0    8   35    0    0    0  347  166    0    0    0  132    9    0  112]
 [   0    1    0    0   50  112    0    7    7    0    0    0  204   70    0    0    0   48    3    0  700]]

10.如何改進結果

(返回內容大綱...)

先看訓練集,再看驗證集!

  • 訓練集結果不好(Underfitting)
    • 資料處理(feature scaling, feature selection, feature engineering)
    • 增加模型複雜度(選擇其他模型,調整參數)
  • 訓練集結果好,驗證集結果不好(Overfitting)
    • 增加資料量
    • 減少雜訊資料
    • 降低模型複雜度
    • 增加隨機性

資料前處理 - Feature scaling

In [35]:
from sklearn.preprocessing import StandardScaler  # (X-mean(X))/std(X)
from sklearn.preprocessing import MinMaxScaler   # (X-min(X))/(max(X)-min(X))
In [36]:
def data_preprocessing(df_input, train=True, sc=None):
    if train:  # 若是 training data,則按照輸入資料做 feature scaling
        sc = StandardScaler()
#         sc = MinMaxScaler()
        df = sc.fit_transform(df_input)

    else:  # 若是 testing data,則按照 training scale 做 feature scaling
        df = sc.transform(df_input)
    return df, sc
In [37]:
train_all_10run.describe()
Out[37]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 101000.000000 101000.000000 101000.00000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000 101000.000000
mean 9.980198 5.500000 260.00495 0.262804 3663.178179 4503.214891 9.360069 26.899056 42.357745 2720.752178 74.908412 120.400096 0.345663 79.790915 49.990765 2648.574014 25.110328 49.949112 3119.447944 22.940343 65.980934 244.679638 340.053394 94.456302 77.012846 31.934102 8.884243 26.768599 6.876507 18.742404 1.629313 32.569461 13.805908 24.569678 1.256861 18.515615 2.223707 4.788594 2.267739 0.018188 0.842274 0.097784 53.750068 43.775112 63.319831 54.059600 30.527387 63.048547 22.055053 40.078360 38.072819 46.416243 50.865050 42.135156 18.642896
std 6.065645 2.872296 138.87286 0.147955 41.819392 107.717276 0.335644 0.231183 0.305365 72.083321 1.290699 0.072700 0.085998 1.726930 1.007209 72.298031 1.067460 1.034812 76.295683 0.642089 1.874903 67.925725 10.657147 1.265589 1.458236 1.723630 0.211569 1.955593 0.130426 0.863641 0.128246 2.586441 0.277607 3.016748 0.138770 1.206004 0.175228 0.336655 0.182061 0.010322 0.087569 0.013491 0.572373 0.606581 2.366096 3.841146 20.278465 6.898345 7.983770 12.525716 2.964073 2.394845 17.033037 9.568130 4.078899
min 0.000000 1.000000 1.00000 -0.003999 3383.500000 3738.200000 7.195900 25.434000 40.444000 2449.300000 63.913000 119.650000 0.055312 68.297000 45.591000 2359.000000 19.844000 45.541000 2893.600000 19.725000 55.305000 -2.368600 255.130000 80.309000 63.684000 23.663000 7.443800 18.629000 6.101100 12.637000 0.932660 20.229000 12.262000 12.324000 0.415950 10.420000 1.297500 3.224700 1.418500 -0.019730 0.398470 0.027588 51.209000 41.022000 58.416000 35.891000 -0.056625 46.231000 -0.040454 0.000000 25.125000 36.215000 -0.268380 -0.210000 -0.002106
25% 5.000000 3.000000 140.00000 0.218450 3636.100000 4467.500000 9.259600 26.759000 42.178000 2697.300000 74.439000 120.380000 0.322470 79.776000 49.310000 2624.800000 24.388000 49.266000 3096.000000 22.512000 65.356000 222.210000 338.950000 94.431000 76.965750 31.884000 8.795300 26.128000 6.797600 18.502000 1.627900 32.558000 13.707000 23.658000 1.181800 18.242000 2.231700 4.768200 2.242500 0.010688 0.816790 0.090034 53.368000 43.399000 62.616000 53.478000 22.318000 60.317000 21.355000 38.268000 36.068000 44.835000 45.322000 40.580000 17.142000
50% 10.000000 5.500000 260.00000 0.251900 3662.500000 4508.000000 9.354700 26.897000 42.344000 2706.300000 74.963000 120.400000 0.336090 80.077000 50.016000 2634.600000 25.126000 49.904000 3103.200000 22.936000 65.847000 233.750000 341.180000 94.595000 77.259000 32.175000 8.887000 26.434000 6.881100 18.777000 1.653700 32.921000 13.815000 24.039000 1.260200 18.568000 2.260800 4.840700 2.296300 0.018454 0.836905 0.098408 53.745000 43.795500 63.078000 53.973000 25.293000 61.482000 22.120000 39.950000 38.146000 46.312000 48.272000 41.200000 18.292000
75% 15.000000 8.000000 380.00000 0.283750 3690.000000 4547.300000 9.456000 27.040000 42.521000 2715.700000 75.456000 120.420000 0.348560 80.365000 50.663000 2644.600000 25.814000 50.652000 3111.400000 23.375000 66.522000 249.440000 343.080000 94.759000 77.536000 32.445000 8.977000 26.758000 6.960100 19.062000 1.677800 33.255000 13.914000 24.479000 1.339600 18.894000 2.287400 4.908500 2.346100 0.025559 0.858940 0.106460 54.112000 44.187000 63.552000 54.451000 28.895000 62.803000 22.758000 41.500000 40.050000 48.043000 52.190000 41.954000 19.516250
max 20.000000 10.000000 500.00000 1.012100 3883.300000 5111.800000 11.777000 28.212000 44.209000 3000.200000 85.122000 120.920000 0.803380 87.051000 54.204000 2945.400000 31.263000 54.498000 3448.700000 26.654000 74.359000 463.460000 392.400000 98.623000 82.555000 39.481000 9.992500 36.255000 7.710200 25.294000 2.140300 43.956000 15.380000 38.829000 2.210200 27.568000 2.966100 6.300600 2.973800 0.057655 1.526600 0.173690 56.024000 46.442000 100.000000 100.000000 100.100000 100.010000 100.050000 95.729000 50.470000 56.943000 100.380000 100.160000 87.332000
In [38]:
train_transform_data, sc = data_preprocessing(train_all_10run.iloc[:, 3:])  # feature scaling for 52 features
In [39]:
pd.DataFrame(train_transform_data, columns=train_all_10run.columns[3:]).describe()
Out[39]:
xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05 1.010000e+05
mean -3.082842e-16 -3.566068e-15 -3.182664e-15 3.062495e-15 -3.545138e-15 -4.001305e-15 -1.936791e-15 -1.652361e-15 -2.983717e-13 2.925475e-16 -3.441011e-15 1.393264e-16 -4.652912e-15 -6.049526e-16 4.710937e-15 -1.923296e-15 -2.617600e-15 5.736529e-15 -3.556437e-16 2.632864e-15 5.891639e-16 3.127258e-15 -8.052876e-16 9.471741e-16 4.050610e-17 8.817352e-15 -2.750210e-15 4.950927e-16 2.821055e-16 7.562351e-15 1.326945e-15 -7.853786e-16 3.553527e-16 2.755913e-15 1.036249e-15 -4.552915e-16 2.597174e-16 -2.671587e-15 -1.088570e-15 5.512631e-15 1.157146e-15 -9.706496e-16 2.219745e-15 3.799540e-16 -8.531992e-16 4.054298e-16 4.482258e-17 1.246878e-15 1.159864e-15 5.392436e-16 7.056495e-16 5.630164e-16
std 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00 1.000005e+00
min -1.803274e+00 -6.687796e+00 -7.102098e+00 -6.447834e+00 -6.337242e+00 -6.267096e+00 -3.765830e+00 -8.518999e+00 -1.031768e+01 -3.376260e+00 -6.655724e+00 -4.368296e+00 -4.005302e+00 -4.933538e+00 -4.259839e+00 -2.960181e+00 -5.007649e+00 -5.694154e+00 -3.637053e+00 -7.968719e+00 -1.117849e+01 -9.140438e+00 -4.798676e+00 -6.808408e+00 -4.162237e+00 -5.945201e+00 -7.069416e+00 -5.432194e+00 -4.771237e+00 -5.561519e+00 -4.059252e+00 -6.059760e+00 -6.712795e+00 -5.285745e+00 -4.645419e+00 -4.664608e+00 -3.673353e+00 -5.068048e+00 -5.203316e+00 -4.439553e+00 -4.538764e+00 -2.072551e+00 -4.730018e+00 -1.508209e+00 -2.437923e+00 -2.767567e+00 -3.199702e+00 -4.368275e+00 -4.259688e+00 -3.002030e+00 -4.425668e+00 -4.571110e+00
25% -2.997790e-01 -6.475061e-01 -3.315630e-01 -2.993321e-01 -6.058248e-01 -5.886251e-01 -3.253498e-01 -3.636901e-01 -2.764201e-01 -2.696932e-01 -8.636623e-03 -6.758961e-01 -3.288351e-01 -6.766824e-01 -6.601343e-01 -3.073314e-01 -6.671113e-01 -3.333171e-01 -3.307988e-01 -1.035361e-01 -1.999227e-02 -3.229675e-02 -2.906809e-02 -4.204004e-01 -3.275743e-01 -6.049981e-01 -2.783630e-01 -1.101653e-02 -4.431365e-03 -3.562904e-01 -3.022072e-01 -5.409054e-01 -2.268789e-01 4.561336e-02 -6.057840e-02 -1.386275e-01 -7.265572e-01 -2.910160e-01 -5.745021e-01 -6.675192e-01 -6.200567e-01 -2.974666e-01 -1.514138e-01 -4.048348e-01 -3.959733e-01 -8.768500e-02 -1.445321e-01 -6.763766e-01 -6.602727e-01 -3.254310e-01 -1.625358e-01 -3.679678e-01
50% -7.369603e-02 -1.621694e-02 4.442307e-02 -1.599508e-02 -8.892273e-03 -4.501132e-02 -2.004937e-01 4.229335e-02 -1.316954e-03 -1.113171e-01 1.656619e-01 2.505419e-02 -1.932844e-01 1.468199e-02 -4.359436e-02 -2.129612e-01 -6.763888e-03 -7.143560e-02 -1.609065e-01 1.057142e-01 1.095923e-01 1.688034e-01 1.397625e-01 1.302932e-02 -1.710992e-01 3.521287e-02 4.005809e-02 1.901606e-01 1.359166e-01 3.275087e-02 -1.759116e-01 2.405930e-02 4.343671e-02 2.116833e-01 1.547769e-01 1.568795e-01 2.576932e-02 -6.131145e-02 4.622118e-02 -8.854702e-03 3.361075e-02 -1.022073e-01 -2.254536e-02 -2.581267e-01 -2.270914e-01 8.134864e-03 -1.024774e-02 2.468936e-02 -4.352828e-02 -1.522373e-01 -9.773699e-02 -8.602751e-02
75% 1.415728e-01 6.413760e-01 4.092689e-01 2.858139e-01 6.096682e-01 5.346250e-01 -7.008838e-02 4.242587e-01 2.737862e-01 3.368659e-02 3.324327e-01 6.674265e-01 -5.496738e-02 6.592059e-01 6.792456e-01 -1.054841e-01 6.769448e-01 2.885848e-01 7.008221e-02 2.839992e-01 2.391770e-01 3.587600e-01 2.964094e-01 4.384239e-01 -5.419701e-03 6.409214e-01 3.700581e-01 3.780818e-01 2.650523e-01 3.893721e-01 -3.005843e-02 5.962302e-01 3.137523e-01 3.634860e-01 3.561713e-01 4.304158e-01 7.141567e-01 1.903187e-01 6.430762e-01 6.323386e-01 6.790352e-01 9.812373e-02 1.018973e-01 -8.049894e-02 -3.559521e-02 8.804738e-02 1.134983e-01 6.670520e-01 6.792778e-01 7.778744e-02 -1.893332e-02 2.141163e-01
max 5.064369e+00 5.263656e+00 5.649865e+00 7.200905e+00 5.679269e+00 6.062456e+00 3.876753e+00 7.913259e+00 7.151365e+00 5.322425e+00 4.204061e+00 4.183099e+00 4.105609e+00 5.763873e+00 4.395880e+00 4.315496e+00 5.783735e+00 4.468555e+00 3.220892e+00 4.911902e+00 3.292317e+00 3.800608e+00 4.378512e+00 5.238292e+00 4.850933e+00 6.392086e+00 7.586058e+00 3.984454e+00 4.402419e+00 5.670247e+00 4.726744e+00 6.869933e+00 7.506137e+00 4.236739e+00 4.491291e+00 3.878181e+00 3.823513e+00 7.814705e+00 5.626504e+00 3.972833e+00 4.396614e+00 1.550248e+01 1.196014e+01 3.430879e+00 5.358044e+00 9.769236e+00 4.442933e+00 4.182503e+00 4.395612e+00 2.907009e+00 6.064417e+00 1.684019e+01
In [40]:
X_train, X_valid, y_train, y_valid = train_test_split(train_transform_data,  # 按給定比例切分資料集
                              train_all_10run['faultNumber'].values,
                              test_size=0.25,  # 切分比例
                              random_state=17,  # 隨機方式
                              stratify=train_all_10run['faultNumber'].values)  # 按照 faultNumber 各類比例
In [41]:
model = LogisticRegression()  # 建立模型
model.fit(X_train, y_train)  # 訓練模型
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
Out[41]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
  • 訓練集上的結果
In [42]:
# model.score(X_train, y_train)

pred_train = model.predict(X_train)  # 預測結果
print(classification_report(y_train, pred_train))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.15      0.25      0.19      3750
         1.0       0.95      0.98      0.97      3600
         2.0       0.96      0.97      0.97      3600
         3.0       0.15      0.06      0.08      3600
         4.0       0.67      1.00      0.80      3600
         5.0       0.91      0.99      0.95      3600
         6.0       1.00      1.00      1.00      3600
         7.0       1.00      1.00      1.00      3600
         8.0       0.35      0.51      0.42      3600
         9.0       0.14      0.05      0.08      3600
        10.0       0.31      0.33      0.32      3600
        11.0       0.11      0.11      0.11      3600
        12.0       0.23      0.32      0.27      3600
        13.0       0.42      0.52      0.46      3600
        14.0       0.43      0.25      0.31      3600
        15.0       0.14      0.08      0.10      3600
        16.0       0.17      0.08      0.11      3600
        17.0       0.68      0.80      0.74      3600
        18.0       0.91      0.88      0.89      3600
        19.0       0.17      0.14      0.15      3600
        20.0       0.67      0.76      0.72      3600

    accuracy                           0.53     75750
   macro avg       0.50      0.53      0.51     75750
weighted avg       0.50      0.53      0.51     75750

  • 驗證集上的結果
In [43]:
pred_valid = model.predict(X_valid)  # 預測結果
print(classification_report(y_valid, pred_valid))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.12      0.21      0.15      1250
         1.0       0.96      0.97      0.97      1200
         2.0       0.96      0.98      0.97      1200
         3.0       0.10      0.04      0.06      1200
         4.0       0.65      1.00      0.78      1200
         5.0       0.89      0.98      0.93      1200
         6.0       1.00      1.00      1.00      1200
         7.0       1.00      1.00      1.00      1200
         8.0       0.37      0.52      0.43      1200
         9.0       0.11      0.04      0.06      1200
        10.0       0.31      0.33      0.32      1200
        11.0       0.11      0.11      0.11      1200
        12.0       0.21      0.31      0.25      1200
        13.0       0.40      0.51      0.45      1200
        14.0       0.40      0.23      0.29      1200
        15.0       0.12      0.06      0.08      1200
        16.0       0.15      0.07      0.10      1200
        17.0       0.68      0.80      0.74      1200
        18.0       0.90      0.87      0.89      1200
        19.0       0.16      0.13      0.15      1200
        20.0       0.67      0.74      0.70      1200

    accuracy                           0.52     25250
   macro avg       0.49      0.52      0.50     25250
weighted avg       0.49      0.52      0.50     25250

In [44]:
print(confusion_matrix(y_valid, pred_valid))  # 列出混淆矩陣(Confusion matrix)
[[ 257    0    0   67    0    0    0    0  111   65  135  166   98   29   62   98   46    0    0   98   18]
 [   1 1165    0    0    0    0    0    0    2    0    0    0   10    7    1    0    0    0    0    0   14]
 [   9    0 1176    1    0    0    0    0    0    1    1    1    0    0    3    4    1    0    0    3    0]
 [ 280    0    0   49    0    0    0    0  104   53  103  185   98   19   59   83   46    0    1   90   30]
 [   0    0    0    0 1199    0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0]
 [   5    0    0    1    0 1172    0    0    5    0    0    0   10    0    2    2    0    0    0    2    1]
 [   0    0    0    0    0    0 1200    0    0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0 1200    0    0    0    0    0    0    0    0    0    0    0    0    0]
 [  51   35   47   29    0    0    0    0  626   11   20   24   80  157   14   22   32    6    0   16   30]
 [ 276    0    0   50    0    0    0    0   92   53  121  158   95   25   67   92   41    0    1  108   21]
 [ 161    0    0   40    0    0    0    0   59   27  401   76  174   46   26   34   40    1   14   79   22]
 [ 160    0    0    4  343    0    0    0   89   69   93  127   51   16   35   30   62   41    0   59   21]
 [  34    0    0   12    1  131    0    0  155   11   73   24  368  170    9   23   31    4   94   15   45]
 [  55    9    2   20    1    0    0    0  119   15   35   29  200  611   12   19   30   12    0   20   11]
 [  47    4    0   39  274    0    0    0   48    5   24   37   20    5  272   14    6  377    1   20    7]
 [ 296    0    0   56    0    0    0    0  106   66  101  142  109   23   55   73   59    0    1   84   29]
 [ 170    0    0   22    0    0    0    0   73   32   52   89   49  379   16   57   87    0    0  163   11]
 [  34    0    0   70   37    0    0    0    8    7   16   19   22    2    6    5    4  956    0    8    6]
 [  28    0    0    5    0   12    0    0   14    7   21   25   18    2    2    3    3    0 1045   12    3]
 [ 152    0    0   20    0    0    0    0   68   38   54   58  282   31   31   42   93    0    1  159  171]
 [  69    0    0   14    0    0    0    0   24   11   43   37   43    9   12   21    3    0    0   28  886]]

模型訓練:選用 Tree-based 模型

In [45]:
from sklearn.tree import DecisionTreeClassifier
In [46]:
X_train, X_valid, y_train, y_valid = train_test_split(train_transform_data,  # 按給定比例切分資料集
                              train_all_10run['faultNumber'].values,
                              test_size=0.25,  # 切分比例
                              random_state=17,  # 隨機方式
                              stratify=train_all_10run['faultNumber'].values)  # 按照 faultNumber 各類比例
In [47]:
model = DecisionTreeClassifier()  # 建立模型
model.fit(X_train, y_train)  # 訓練模型
Out[47]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
  • 訓練集上的結果
In [48]:
# model.score(X_train, y_train)

pred_train = model.predict(X_train)  # 預測結果
print(classification_report(y_train, pred_train))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.83      1.00      0.91      3750
         1.0       1.00      1.00      1.00      3600
         2.0       1.00      1.00      1.00      3600
         3.0       1.00      1.00      1.00      3600
         4.0       1.00      1.00      1.00      3600
         5.0       1.00      1.00      1.00      3600
         6.0       1.00      1.00      1.00      3600
         7.0       1.00      1.00      1.00      3600
         8.0       0.99      0.99      0.99      3600
         9.0       0.99      0.99      0.99      3600
        10.0       0.99      0.98      0.98      3600
        11.0       1.00      1.00      1.00      3600
        12.0       1.00      1.00      1.00      3600
        13.0       0.99      0.97      0.98      3600
        14.0       1.00      1.00      1.00      3600
        15.0       1.00      1.00      1.00      3600
        16.0       1.00      0.99      0.99      3600
        17.0       0.99      0.94      0.97      3600
        18.0       1.00      0.94      0.97      3600
        19.0       1.00      1.00      1.00      3600
        20.0       1.00      0.94      0.97      3600

    accuracy                           0.99     75750
   macro avg       0.99      0.99      0.99     75750
weighted avg       0.99      0.99      0.99     75750

  • 驗證集上的結果
In [49]:
pred_valid = model.predict(X_valid)  # 預測結果
print(classification_report(y_valid, pred_valid))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.06      0.09      0.07      1250
         1.0       0.98      0.98      0.98      1200
         2.0       0.99      0.98      0.99      1200
         3.0       0.09      0.11      0.10      1200
         4.0       0.93      0.94      0.93      1200
         5.0       0.91      0.88      0.89      1200
         6.0       1.00      1.00      1.00      1200
         7.0       1.00      0.99      0.99      1200
         8.0       0.87      0.83      0.85      1200
         9.0       0.14      0.13      0.14      1200
        10.0       0.68      0.64      0.66      1200
        11.0       0.74      0.74      0.74      1200
        12.0       0.80      0.77      0.78      1200
        13.0       0.82      0.81      0.82      1200
        14.0       0.96      0.96      0.96      1200
        15.0       0.17      0.15      0.16      1200
        16.0       0.71      0.67      0.69      1200
        17.0       0.88      0.85      0.86      1200
        18.0       0.84      0.81      0.83      1200
        19.0       0.73      0.63      0.68      1200
        20.0       0.68      0.59      0.64      1200

    accuracy                           0.69     25250
   macro avg       0.71      0.69      0.70     25250
weighted avg       0.71      0.69      0.70     25250

In [50]:
print(confusion_matrix(y_valid, pred_valid))  # 列出混淆矩陣(Confusion matrix)
[[ 110    0    0  572    0    0    0    0    4  198   35   22    0   18    0  156   19   25   31   21   39]
 [   1 1173    0    1    0    0    0    0    9    1    0    0    2    1    0    5    0    2    1    1    3]
 [   2    0 1182    4    0    0    0    0    2    0    2    2    0    1    0    3    1    0    0    0    1]
 [ 633    0    3  128    1    3    0    0    4  181   21   19    0   14    0  132   12   13    9   17   10]
 [   0    0    0    0 1126    0    0    0    2    0    0   72    0    0    0    0    0    0    0    0    0]
 [   3    2    0    9    0 1053    0    0   17   13   19    2   36   10    0   12    9    2    1    3    9]
 [   0    0    0    0    0    0 1198    0    1    0    0    0    0    0    0    0    0    0    0    0    1]
 [   0    0    0    0    0    0    0 1189    2    0    0    0    5    1    0    0    0    0    3    0    0]
 [  26   11    4    3    0   17    0    0  991   11   10    3   46   33    0    8   10    2    8    5   12]
 [ 332    0    1  250    0    8    0    0    4  155   36   38    2   17    1  230   50    4    9   29   34]
 [  92    0    0   37    0   10    0    0    5   53  772    5    6   17    0   56   80    5    8   23   31]
 [  47    0    0   39   85    1    0    0    1   25    4  894    1    3   11   25    6   17    4   31    6]
 [  11    0    0    5    0   31    0    3   44    6   20   11  921   37    0    9    9    3   71    6   13]
 [  53    3    0   10    0    7    0    0   36   13   19    3   35  970    0   14    7    7    5    2   16]
 [   1    0    0    1    0    1    0    0    0    0    0   15    0    1 1149    0    0   31    0    1    0]
 [ 310    1    2  230    0    5    0    0    4  223   34   28    4   17    1  185   46    4   17   35   54]
 [  60    0    1   37    0    4    0    0    4   58   80    4   13   16    0   62  801    1    3   32   24]
 [  54    0    0   11    0    2    0    0    1    5    6   24    4    4   33    8    1 1020    4   14    9]
 [  76    0    0    7    0    0    0    2    8   17   11    3   64    8    0    9    0    7  977    3    8]
 [  45    1    2   61    0    5    0    0    3   55   22   53    2    3    0   77   45    7    3  754   62]
 [ 105    1    2   49    0    8    0    0    3   72   46   12    8    6    0   67   37    9    6   56  713]]

Overfitting

模型過度擬合在訓練資料集上(綠色線)達到很高的準確率,導致在未參與訓練的驗證資料集上表現不佳

抑制 Overfitting 的方法:

  • 增加資料量
  • 減少雜訊資料
  • 降低模型複雜度
  • 增加隨機性

模型訓練:選用 Ensemble 模型

In [51]:
from lightgbm import LGBMClassifier
In [52]:
X_train, X_valid, y_train, y_valid = train_test_split(train_transform_data,  # 按給定比例切分資料集
                              train_all_10run['faultNumber'].values,
                              test_size=0.25,  # 切分比例
                              random_state=17,  # 隨機方式
                              stratify=train_all_10run['faultNumber'].values)  # 按照 faultNumber 各類比例
In [53]:
model = LGBMClassifier()  # 建立模型
model.fit(X_train, y_train)  # 訓練模型
Out[53]:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
  • 訓練集上的結果
In [54]:
# model.score(X_train, y_train)

pred_train = model.predict(X_train)  # 預測結果
print(classification_report(y_train, pred_train))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.44      0.61      0.51      3750
         1.0       1.00      1.00      1.00      3600
         2.0       1.00      1.00      1.00      3600
         3.0       0.46      0.64      0.53      3600
         4.0       0.99      1.00      0.99      3600
         5.0       1.00      1.00      1.00      3600
         6.0       1.00      1.00      1.00      3600
         7.0       1.00      1.00      1.00      3600
         8.0       1.00      0.98      0.99      3600
         9.0       0.66      0.55      0.60      3600
        10.0       0.96      0.89      0.92      3600
        11.0       0.97      0.94      0.96      3600
        12.0       0.99      0.98      0.99      3600
        13.0       1.00      0.94      0.97      3600
        14.0       1.00      1.00      1.00      3600
        15.0       0.71      0.58      0.64      3600
        16.0       0.99      0.88      0.93      3600
        17.0       1.00      0.94      0.96      3600
        18.0       0.99      0.92      0.96      3600
        19.0       0.90      0.94      0.92      3600
        20.0       0.97      0.89      0.93      3600

    accuracy                           0.89     75750
   macro avg       0.91      0.89      0.90     75750
weighted avg       0.90      0.89      0.89     75750

  • 驗證集上的結果
In [55]:
pred_valid = model.predict(X_valid)  # 預測結果
print(classification_report(y_valid, pred_valid))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.15      0.23      0.18      1250
         1.0       1.00      0.99      0.99      1200
         2.0       0.99      0.99      0.99      1200
         3.0       0.18      0.28      0.21      1200
         4.0       0.93      0.98      0.96      1200
         5.0       0.99      0.99      0.99      1200
         6.0       1.00      1.00      1.00      1200
         7.0       1.00      1.00      1.00      1200
         8.0       0.99      0.95      0.97      1200
         9.0       0.24      0.19      0.21      1200
        10.0       0.88      0.80      0.84      1200
        11.0       0.89      0.82      0.86      1200
        12.0       0.94      0.93      0.94      1200
        13.0       0.99      0.88      0.93      1200
        14.0       0.99      0.97      0.98      1200
        15.0       0.26      0.21      0.23      1200
        16.0       0.98      0.79      0.87      1200
        17.0       0.95      0.91      0.93      1200
        18.0       0.97      0.89      0.93      1200
        19.0       0.84      0.87      0.85      1200
        20.0       0.89      0.75      0.81      1200

    accuracy                           0.78     25250
   macro avg       0.81      0.78      0.79     25250
weighted avg       0.81      0.78      0.79     25250

In [56]:
print(confusion_matrix(y_valid, pred_valid))  # 列出混淆矩陣(Confusion matrix)
[[ 288    0    0  499    0    0    0    0    0  204    6    8    0    0    0  174    0    7    8   41   15]
 [   2 1187    3    0    0    0    0    0    2    0    0    0    3    0    0    1    0    0    0    0    2]
 [   5    0 1188    3    0    0    0    0    0    1    0    0    0    0    0    2    0    0    0    1    0]
 [ 525    0    1  331    0    0    0    0    1  151    6    9    0    0    0  136    1    4    1   24   10]
 [   0    0    0    0 1182    0    0    0    0    0    0   18    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0 1184    0    0    0    0    1    0   13    1    0    0    0    0    0    1    0]
 [   0    0    0    0    0    0 1200    0    0    0    0    0    0    0    0    0    0    0    0    0    0]
 [   0    0    0    0    0    0    0 1199    0    0    0    0    1    0    0    0    0    0    0    0    0]
 [  24    1    1   14    0    0    0    0 1135    6    1    0    7    3    0    2    0    0    0    1    5]
 [ 344    0    0  366    0    0    0    0    1  227    9   13    0    0    0  187    1    0    1   36   15]
 [  76    0    0   58    0    0    0    0    0   37  956    7    2    0    0   34    3    2    1   13   11]
 [  32    0    1   29   88    0    0    0    0   21    0  989    0    0    2   26    0    4    0    7    1]
 [   8    0    0    1    0   14    0    0    5    1   12    3 1117    3    0    9    1    0   20    3    3]
 [  46    0    0   38    0    0    0    0    2   15    8    1   20 1051    0   13    0    1    2    2    1]
 [   0    0    0    1    0    0    0    0    0    0    0    2    1    0 1170    0    0   26    0    0    0]
 [ 356    0    0  352    0    0    0    0    0  164    5   14    2    0    0  247    3    0    1   36   20]
 [  59    0    0   51    0    0    0    0    0   33   55    4    3    0    0   30  945    2    0    5   13]
 [  42    0    0   24    0    0    0    0    0    6    2    8    0    0    8   11    0 1094    0    4    1]
 [  69    0    0   21    0    0    0    0    0   12    0    3   14    1    0   11    0    2 1064    2    1]
 [  21    0    0   42    0    0    0    0    0   28    6   22    0    0    0   15   10    2    0 1042   12]
 [  87    0    0   58    0    1    0    0    0   41   14   11    5    0    0   57    4    3    1   22  896]]

資料特徵工程 - Principle Component Analysis(PCA)

In [57]:
from sklearn.decomposition import PCA
In [58]:
pca = PCA(n_components=0.99) # n_components: int, float or ‘mle’
pca.fit(train_transform_data)  # 計算共變異數矩陣並進行特徵分解
pca_transformed_data = pca.transform(train_transform_data)  # pca 降維轉換
In [59]:
pca_transformed_data.shape
Out[59]:
(101000, 34)
In [60]:
pca.explained_variance_ratio_
Out[60]:
array([0.244, 0.181, 0.0777, 0.0587, 0.0398, 0.036, 0.0291, 0.028, 0.0273, 0.023, 0.0194, 0.019, 0.0183, 0.0175, 0.0165, 0.0152, 0.0146, 0.0137,
       0.0136, 0.0123, 0.0118, 0.0102, 0.00992, 0.00916, 0.00841, 0.00692, 0.00608, 0.00582, 0.00522, 0.00354, 0.00235, 0.00224, 0.00203, 0.00185])
In [61]:
pca.explained_variance_ratio_.sum()
Out[61]:
0.990136685271072
In [62]:
X_train, X_valid, y_train, y_valid = train_test_split(pca_transformed_data,  # 按給定比例切分資料集
                              train_all_10run['faultNumber'].values,
                              test_size=0.25,  # 切分比例
                              random_state=17,  # 隨機方式
                              stratify=train_all_10run['faultNumber'].values)  # 按照 faultNumber 各類比例
In [63]:
model = LGBMClassifier()  # 建立模型
model.fit(X_train, y_train)  # 訓練模型
Out[63]:
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
  • 訓練集上的結果
In [64]:
# model.score(X_train, y_train)

pred_train = model.predict(X_train)  # 預測結果
print(classification_report(y_train, pred_train))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.31      0.50      0.38      3750
         1.0       1.00      1.00      1.00      3600
         2.0       1.00      1.00      1.00      3600
         3.0       0.40      0.36      0.38      3600
         4.0       0.69      0.96      0.80      3600
         5.0       0.98      0.99      0.99      3600
         6.0       1.00      1.00      1.00      3600
         7.0       1.00      1.00      1.00      3600
         8.0       0.99      0.95      0.97      3600
         9.0       0.46      0.41      0.44      3600
        10.0       0.67      0.61      0.64      3600
        11.0       0.75      0.62      0.68      3600
        12.0       0.97      0.93      0.95      3600
        13.0       1.00      0.90      0.94      3600
        14.0       0.98      0.97      0.98      3600
        15.0       0.48      0.44      0.46      3600
        16.0       0.69      0.50      0.58      3600
        17.0       0.97      0.88      0.92      3600
        18.0       0.99      0.90      0.94      3600
        19.0       0.65      0.72      0.68      3600
        20.0       0.75      0.76      0.75      3600

    accuracy                           0.78     75750
   macro avg       0.80      0.78      0.78     75750
weighted avg       0.79      0.78      0.78     75750

  • 驗證集上的結果
In [65]:
pred_valid = model.predict(X_valid)  # 預測結果
print(classification_report(y_valid, pred_valid))  # 評估結果
              precision    recall  f1-score   support

         0.0       0.06      0.11      0.08      1250
         1.0       0.98      0.97      0.97      1200
         2.0       0.99      0.98      0.99      1200
         3.0       0.03      0.03      0.03      1200
         4.0       0.53      0.81      0.64      1200
         5.0       0.95      0.96      0.95      1200
         6.0       1.00      0.99      1.00      1200
         7.0       1.00      1.00      1.00      1200
         8.0       0.94      0.85      0.89      1200
         9.0       0.04      0.04      0.04      1200
        10.0       0.29      0.27      0.28      1200
        11.0       0.56      0.45      0.50      1200
        12.0       0.85      0.81      0.83      1200
        13.0       0.97      0.81      0.88      1200
        14.0       0.94      0.93      0.94      1200
        15.0       0.04      0.04      0.04      1200
        16.0       0.21      0.15      0.17      1200
        17.0       0.91      0.79      0.85      1200
        18.0       0.94      0.81      0.87      1200
        19.0       0.51      0.58      0.54      1200
        20.0       0.55      0.52      0.53      1200

    accuracy                           0.61     25250
   macro avg       0.63      0.61      0.62     25250
weighted avg       0.63      0.61      0.62     25250

In [66]:
print(confusion_matrix(y_valid, pred_valid))  # 列出混淆矩陣(Confusion matrix)
[[ 132    0    0  349   73    0    0    0    1  241   70   31    0    0    0  170   55    0    2   80   46]
 [   6 1161    1    1    1    0    0    0   14    3    1    0    0    1    0    4    1    0    0    1    5]
 [   8    0 1172    2    3    0    0    0    1    2    0    0    0    0    0    5    2    0    0    0    5]
 [ 466    0    0   37   86    2    0    0    1  180   58   34    0    0    0  175   49    1    0   75   36]
 [  26    0    0   36  972    4    0    0    0   33   20   20    0    0    1   27   21    1    0   39    0]
 [   4    0    0    0    8 1148    0    0    5    2    8    2   18    2    0    0    2    0    0    1    0]
 [   0    0    0    1    1    0 1194    0    0    0    0    1    0    0    0    0    0    0    0    1    2]
 [   0    0    0    0    0    0    0 1197    0    0    0    0    3    0    0    0    0    0    0    0    0]
 [  22   19    3    9    4    1    0    2 1025    7   19    4   36   14    0    9    6    0    1    5   14]
 [ 390    0    0  222   82    0    0    0    0   44   76   33    1    1    0  180   59    0    0   79   33]
 [ 148    0    0   71   65    2    0    0    4   77  322   32    8    0    0  109  222    4    1   67   68]
 [  85    0    1   36  212    2    0    0    0   40   23  534    0    0   37   44   24   11    0   41  110]
 [   8    1    0    6    4   32    0    0   21    3   28   10  977   12    0    8   15    0   56    7   12]
 [  47    1    0   16    6    1    0    0   12   15   18    5   36  978    2   22   14    0    2   10   15]
 [   5    0    0    0    1    0    0    0    0    0    3   14    0    0 1115    0    0   62    0    0    0]
 [ 351    0    2  188   81    0    0    0    2  193   76   40    2    0    0   47   85    1    1   85   46]
 [ 206    0    0   89   82    2    0    0    7  120  257   31    8    0    0  101  175    2    0   72   48]
 [  39    0    0   30   23    1    0    0    1   14    8   47    1    0   30   25    6  946    0   16   13]
 [  49    0    0   19   13    4    0    1    0   14   16    5   54    2    0   15    4    0  978   12   14]
 [  85    0    0   57   72    5    0    0    1   41   40   75    2    0    0   43   32    4    0  693   50]
 [ 130    0    0   56   32    3    0    0    1   61   53   36    4    1    0   62   62    3    1   68  627]]

11.異常檢測的特殊作法

(返回內容大綱...)

通常在做異常檢測時,蒐集到的資料集為大量的正常資料,以及少數的異常資料。因此,若以單純的分類做法,往往會因為資料不平均而導致未能檢測出異常資料。

然而在異常檢測問題上,會利用以下的想法,來訓練模型:

  • 僅訓練正常資料(學習正常資料的特徵)
  • 利用正常資料的特徵轉換,對所有資料做轉換,將預期異常資料無法正常轉換或者與正常資料的特徵有較大的差異

為了更貼近實際狀況-正常資料與異常資料比例差異懸殊,將會使用模擬 100 回正常訓練資料,以及 10 回異常資料來做演示

PCA 故障檢測

  • T$^2$ 統計量
In [67]:
train_normal_df.describe()
Out[67]:
faultNumber simulationRun sample xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 50000.0 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.00000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 0.0 50.500000 250.500000 0.250489 3663.616392 4508.981474 9.347079 26.900955 42.337796 2705.069408 75.003094 120.400020 0.337083 80.105536 49.999391 2633.797418 25.163457 50.000156 3102.269580 22.950367 65.80472 232.306067 341.418245 94.600724 77.293118 32.187174 8.892558 26.386004 6.882060 18.775477 1.656792 32.959006 13.822784 23.979229 1.256273 18.579226 2.263261 4.843595 2.298232 0.017881 0.835798 0.098704 53.712516 43.829155 63.044609 53.977376 24.640612 61.294671 22.213640 40.059964 38.098213 46.534364 47.993337 41.103554 18.114315
std 0.0 28.866359 144.338722 0.031040 34.175783 39.246980 0.086483 0.211427 0.220018 7.701423 0.546435 0.019139 0.012513 0.242917 1.005700 8.060942 1.017879 1.018949 6.604711 0.616667 0.42762 10.496295 1.708033 0.133443 0.261435 0.291841 0.103620 0.315884 0.108585 0.294791 0.025604 0.340328 0.108189 0.386353 0.101335 0.340981 0.026653 0.066945 0.053589 0.009937 0.018734 0.010211 0.501059 0.508716 0.588010 0.471831 3.054117 1.241120 0.548256 1.529840 2.959632 2.358139 2.746986 0.542482 1.463330
min 0.0 1.000000 1.000000 0.131300 3516.700000 4354.800000 9.024700 26.099000 41.472000 2678.900000 72.936000 120.310000 0.287120 79.046000 46.121000 2606.400000 20.752000 46.006000 3080.100000 20.448000 64.19300 190.230000 335.170000 94.074000 76.133000 30.970000 8.512900 25.241000 6.426000 17.685000 1.558000 31.635000 13.411000 22.537000 0.823720 17.288000 2.172700 4.561600 2.095800 -0.014288 0.765020 0.061483 51.934000 42.121000 60.806000 52.095000 12.610000 56.078000 20.128000 34.204000 26.683000 37.290000 37.852000 38.586000 12.543000
25% 0.0 25.750000 125.750000 0.229850 3640.500000 4482.700000 9.288975 26.757000 42.187000 2699.900000 74.630000 120.390000 0.328650 79.944000 49.314000 2628.400000 24.476000 49.307000 3097.800000 22.530000 65.53900 226.070000 340.310000 94.510000 77.115000 31.992000 8.821700 26.173000 6.807900 18.576000 1.639500 32.729000 13.750000 23.721000 1.187375 18.352000 2.245100 4.798500 2.262400 0.011184 0.823765 0.091607 53.362000 43.480000 62.646000 53.660000 22.608000 60.461000 21.849000 39.024000 36.081000 44.930000 46.320000 40.738000 17.119000
50% 0.0 50.500000 250.500000 0.250700 3663.400000 4508.800000 9.347300 26.901000 42.338000 2705.000000 75.004000 120.400000 0.337120 80.107000 49.997000 2633.800000 25.161000 49.999000 3102.200000 22.950000 65.79400 231.910000 341.410000 94.601000 77.293000 32.188000 8.893300 26.385000 6.882400 18.776000 1.656700 32.959000 13.823000 23.978000 1.256500 18.580500 2.263300 4.843600 2.298400 0.017866 0.835700 0.098577 53.719000 43.828000 63.041000 53.979000 24.661000 61.296000 22.207000 40.061000 38.090000 46.532000 47.851000 41.104000 18.116500
75% 0.0 75.250000 375.250000 0.271250 3686.800000 4535.300000 9.405600 27.045000 42.488000 2710.100000 75.376000 120.410000 0.345480 80.269000 50.684000 2639.100000 25.851000 50.694000 3106.600000 23.370000 66.07400 238.890000 342.510000 94.691000 77.468000 32.383000 8.963100 26.599000 6.955700 18.972000 1.674300 33.187000 13.897000 24.241000 1.324425 18.810000 2.281500 4.888800 2.334200 0.024643 0.848193 0.105660 54.054000 44.177000 63.445000 54.294000 26.685250 62.140250 22.571000 41.088000 40.112000 48.141000 49.706000 41.470000 19.109000
max 0.0 100.000000 500.000000 0.385090 3800.900000 4663.800000 9.664900 27.785000 43.257000 2739.100000 77.514000 120.480000 0.386220 81.062000 53.972000 2669.500000 29.560000 53.940000 3131.300000 25.292000 67.43800 272.530000 347.440000 95.206000 78.396000 33.395000 9.281200 27.594000 7.256200 19.917000 1.748600 34.365000 14.245000 25.496000 1.641400 19.975000 2.358100 5.109200 2.519100 0.055001 0.901320 0.136340 55.614000 45.979000 65.606000 55.778000 37.586000 66.115000 24.525000 46.228000 49.789000 55.652000 58.843000 43.354000 24.126000
In [68]:
sc = StandardScaler()
sc_normal = sc.fit_transform(train_normal_df.iloc[:, 3:])  # 按照 normal data scale 做 feature scaling
In [69]:
pca_normal = PCA(n_components=0.99) # n_components: int, float or ‘mle’
transformed_normal_data = pca_normal.fit_transform(sc_normal)  # 計算共變異數矩陣並進行特徵分解後,做降維轉換
In [70]:
pca_normal.explained_variance_ratio_
Out[70]:
array([0.159, 0.0926, 0.0503, 0.0395, 0.0385, 0.0377, 0.034, 0.0305, 0.0267, 0.023, 0.0222, 0.0204, 0.0199, 0.0195, 0.0194, 0.0189, 0.0187, 0.0187,
       0.0184, 0.0183, 0.0182, 0.0177, 0.0177, 0.0174, 0.0172, 0.0166, 0.0163, 0.0158, 0.0137, 0.0133, 0.013, 0.0126, 0.0117, 0.0113, 0.0109,
       0.00979, 0.00886, 0.00818, 0.00713, 0.00642, 0.00465])
In [71]:
transformed_normal_data.shape
Out[71]:
(50000, 41)
In [72]:
T2_normal_data = (transformed_normal_data**2 / pca_normal.explained_variance_).sum(axis=1)  # 計算 T^2 統計量
In [73]:
sc_fault = sc.transform(train_fault_df.iloc[:, 3:])  # 按照 normal data scale 對 fault data 做 feature scaling
transformed_fault_data = pca_normal.transform(sc_fault)  # 按照 normal data 的降維方式對 fault data 做降維
In [74]:
transformed_fault_data.shape
Out[74]:
(100000, 41)
In [75]:
T2_fault_data = (transformed_fault_data**2 / pca_normal.explained_variance_).sum(axis=1)  # 計算 T^2 統計量
In [76]:
plt.figure(figsize=(25, 40))  # 設定畫布大小
for idx, i in enumerate(range(20)):  # 查看第 i 種 fault 的 T^2 分布
  plt.subplot(10, 2, idx+1)  # 在 10 列 2 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log
  
  plt.plot(range(len(T2_normal_data[:500])),  # 繪製折線圖
           T2_normal_data[:500],
           label='Normal')
  
  plt.plot(range(len(T2_fault_data[i*500:(i+1)*500])),  # 繪製折線圖
           T2_fault_data[i*500:(i+1)*500],
           label=f'Fault_{i+1}')
  
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.tight_layout()  # 自動保持子圖之間的間距
  • Q 統計量,又名 SPE(Square Prediction Error) 統計量
In [77]:
normal_project = np.einsum('ik,kf->ifk',
               transformed_normal_data,
               pca_normal.components_).sum(axis=2)  # 計算 normal data 的投影向量
Q_normal_data = ((normal_project-(train_normal_df.iloc[:, 3:]-train_normal_df.mean()[3:]))**2).sum(axis=1) # 計算 Q 統計量
In [78]:
fault_project = np.einsum('ik,kf->ifk',
               transformed_fault_data,
               pca_normal.components_).sum(axis=2)  # 計算 fault data 的投影向量
Q_fault_data = ((fault_project-(train_fault_df.iloc[:, 3:]-train_fault_df.mean()[3:]))**2).sum(axis=1)  # 計算 Q 統計量
In [79]:
plt.figure(figsize=(25, 40))  # 設定畫布大小
for idx, i in enumerate(range(20)):  # 查看第 i 種 fault 的 Q 分布
  plt.subplot(10, 2, idx+1)  # 在 10 列 2 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log

  plt.plot(range(len(Q_normal_data[:500])),  # 繪製折線圖
           Q_normal_data[:500],
           label='Normal')
  
  plt.plot(range(len(Q_fault_data[i*500:(i+1)*500])),  # 繪製折線圖
           Q_fault_data[i*500:(i+1)*500],
           label=f'Fault_{i+1}')
  
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.tight_layout()  # 自動保持子圖之間的間距

Ensembling

  • 訓練 N 個模型,利用正常訓練集中的 N-1 個特徵,去預測餘下的 1 個特徵值
  • 預測異常訓練集的各個特徵
  • 計算預測出來的所有特徵值,與原始值的差異(使用 MSE 等指標評估距離)

In [80]:
sc = StandardScaler()
sc_normal_df = pd.DataFrame(sc.fit_transform(train_normal_df.iloc[:, 3:]),  # 按照 normal data scale 做 feature scaling
                columns=train_normal_df.iloc[:, 3:].columns) 
In [81]:
import tqdm
from lightgbm import LGBMRegressor
In [82]:
def train(df, cols_to_predict):
  models = {}  # 各個模型存放位置
  pbar = tqdm.tqdm_notebook(cols_to_predict)
  for col in pbar:
    pbar.set_description(f'Training model for {col}')
    model = LGBMRegressor(learning_rate=0.1)  # 建立模型
    tr_x = df.drop([col],axis=1)  # 取其中 N-1 個特徵作為 X
    target = df[col]  # 取剩餘的 1 個特徵作為 y
    
    model.fit(X=tr_x, y=target) # 訓練模型
    models[col] = model  # 將訓練好的模型存放至 models
    
  return models

def predict(models, df, cols_to_predict):
  preds = []
  for col in cols_to_predict:
      test_x = df.drop([col], axis=1)  # 取其中 N-1 個特徵作為 X
      
      pred = models[col].predict(test_x)  # 預測剩餘的 1 個特徵
      preds.append(pred)
  
  return preds
In [83]:
sc_normal_df.describe()
Out[83]:
xmeas_1 xmeas_2 xmeas_3 xmeas_4 xmeas_5 xmeas_6 xmeas_7 xmeas_8 xmeas_9 xmeas_10 xmeas_11 xmeas_12 xmeas_13 xmeas_14 xmeas_15 xmeas_16 xmeas_17 xmeas_18 xmeas_19 xmeas_20 xmeas_21 xmeas_22 xmeas_23 xmeas_24 xmeas_25 xmeas_26 xmeas_27 xmeas_28 xmeas_29 xmeas_30 xmeas_31 xmeas_32 xmeas_33 xmeas_34 xmeas_35 xmeas_36 xmeas_37 xmeas_38 xmeas_39 xmeas_40 xmeas_41 xmv_1 xmv_2 xmv_3 xmv_4 xmv_5 xmv_6 xmv_7 xmv_8 xmv_9 xmv_10 xmv_11
count 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04 5.000000e+04
mean 3.000095e-16 2.748333e-15 -4.126204e-15 7.616450e-15 -2.624240e-14 2.199142e-14 8.003445e-14 -8.555461e-15 -1.943677e-12 1.989562e-15 -1.519608e-14 9.316001e-15 -1.321918e-14 -6.237277e-16 -2.060736e-15 1.174157e-13 2.113902e-15 1.580490e-14 -8.490897e-16 -7.580696e-15 7.453509e-14 -7.345102e-15 8.306937e-15 -4.903793e-15 5.132783e-15 3.215876e-15 -1.023591e-14 3.092504e-16 -1.755063e-15 1.195851e-14 5.341736e-15 -2.379448e-15 7.407213e-15 9.429968e-15 5.727906e-15 1.018623e-14 1.632605e-16 3.548803e-15 -1.947926e-15 -2.286487e-15 5.930082e-15 3.416106e-15 -7.715029e-16 -8.459067e-17 -5.207800e-15 -1.745066e-15 2.978048e-15 -6.654410e-16 -1.018889e-15 2.140021e-16 4.276716e-15 7.966094e-16
std 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00
min -3.839941e+00 -4.298888e+00 -3.928532e+00 -3.727710e+00 -3.793097e+00 -3.935146e+00 -3.398030e+00 -3.782909e+00 -4.703447e+00 -3.993038e+00 -4.361764e+00 -3.856447e+00 -3.398820e+00 -4.334013e+00 -3.919919e+00 -3.356665e+00 -4.057932e+00 -3.769083e+00 -4.008699e+00 -3.658188e+00 -3.947214e+00 -4.437551e+00 -4.170717e+00 -3.663973e+00 -3.624795e+00 -4.200050e+00 -3.699189e+00 -3.858463e+00 -3.890423e+00 -3.806187e+00 -3.732973e+00 -4.268597e+00 -3.786838e+00 -3.397783e+00 -4.212345e+00 -3.777519e+00 -3.237186e+00 -3.778064e+00 -3.645256e+00 -3.549553e+00 -3.357808e+00 -3.807131e+00 -3.989558e+00 -3.939185e+00 -4.203239e+00 -3.804174e+00 -3.827866e+00 -3.857009e+00 -3.920234e+00 -3.691842e+00 -4.640856e+00 -3.807323e+00
25% -6.649418e-01 -6.764036e-01 -6.696499e-01 -6.718593e-01 -6.808784e-01 -6.853854e-01 -6.712343e-01 -6.827845e-01 -5.235248e-01 -6.739354e-01 -6.649923e-01 -6.815133e-01 -6.695832e-01 -6.753889e-01 -6.802728e-01 -6.767329e-01 -6.816823e-01 -6.213984e-01 -5.941267e-01 -6.488492e-01 -6.798785e-01 -6.813164e-01 -6.687737e-01 -6.838323e-01 -6.743176e-01 -6.829703e-01 -6.766805e-01 -6.753535e-01 -6.758424e-01 -6.727566e-01 -6.683839e-01 -6.799141e-01 -6.663965e-01 -6.813873e-01 -6.736094e-01 -6.686443e-01 -6.739289e-01 -6.423217e-01 -6.949712e-01 -6.995581e-01 -6.863516e-01 -6.779013e-01 -6.726556e-01 -6.655384e-01 -6.717154e-01 -6.650981e-01 -6.771783e-01 -6.815823e-01 -6.803587e-01 -6.091598e-01 -6.738611e-01 -6.801780e-01
50% 6.785548e-03 -6.331798e-03 -4.623943e-03 2.561121e-03 2.149228e-04 9.276589e-04 -9.012450e-03 1.658658e-03 -1.034531e-03 2.992116e-03 6.025244e-03 -2.377849e-03 3.203131e-04 -2.414162e-03 -1.134671e-03 -1.053501e-02 -5.944275e-04 -2.506894e-02 -3.773432e-02 -4.827237e-03 2.065315e-03 -4.511306e-04 2.831845e-03 7.158822e-03 -3.178918e-03 3.131942e-03 1.772662e-03 -3.580705e-03 -1.657242e-05 1.993934e-03 -3.181788e-03 2.237097e-03 3.735848e-03 1.463552e-03 7.982695e-05 3.141344e-03 -1.516577e-03 -5.244742e-03 -1.240350e-02 1.294053e-02 -2.270050e-03 -6.137371e-03 3.441355e-03 6.675731e-03 1.070963e-03 -1.211187e-02 6.770710e-04 -2.774866e-03 -1.002640e-03 -5.181611e-02 8.230409e-04 1.493116e-03
75% 6.688477e-01 6.783704e-01 6.705940e-01 6.766925e-01 6.813083e-01 6.826956e-01 6.532094e-01 6.824417e-01 5.214558e-01 6.711284e-01 6.729261e-01 6.807350e-01 6.578182e-01 6.754728e-01 6.809477e-01 6.556628e-01 6.804934e-01 6.297242e-01 6.272689e-01 6.391947e-01 6.765152e-01 6.689389e-01 6.710109e-01 6.807786e-01 6.742912e-01 6.781829e-01 6.666568e-01 6.838148e-01 6.699325e-01 6.859876e-01 6.775503e-01 6.725463e-01 6.768009e-01 6.843144e-01 6.752629e-01 6.711949e-01 6.804305e-01 6.615910e-01 6.812800e-01 6.815317e-01 6.837772e-01 6.809326e-01 6.710606e-01 6.694762e-01 6.813103e-01 6.518182e-01 6.719957e-01 6.804250e-01 6.813219e-01 6.234762e-01 6.755071e-01 6.797474e-01
max 4.336447e+00 4.017025e+00 3.944764e+00 3.675016e+00 4.181371e+00 4.177892e+00 4.418785e+00 4.595114e+00 4.178888e+00 3.927094e+00 3.937447e+00 3.950131e+00 4.429127e+00 4.319361e+00 3.866616e+00 4.395455e+00 3.797281e+00 3.819502e+00 3.832241e+00 3.525584e+00 4.535867e+00 4.218619e+00 4.138689e+00 3.750671e+00 3.824211e+00 3.445616e+00 3.872348e+00 3.585717e+00 4.131336e+00 3.902607e+00 3.925911e+00 3.800573e+00 4.093449e+00 3.558291e+00 3.967527e+00 4.121561e+00 3.735393e+00 3.497478e+00 3.685968e+00 3.794971e+00 4.226062e+00 4.356077e+00 3.816289e+00 4.238711e+00 3.883894e+00 4.215883e+00 4.031857e+00 3.950121e+00 3.866493e+00 3.949701e+00 4.148471e+00 4.108263e+00
In [84]:
features_to_predict = sc_normal_df.columns
models = train(sc_normal_df, features_to_predict)  # 透過自定義的 train 訓練模型
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:3: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until
In [85]:
def get_mse(sample, preds):
    return np.square((sample.loc[:,features_to_predict] - np.transpose(preds))**2).mean(axis=1)  # 計算 mean square error
In [86]:
plt.figure(figsize=(25,40))  # 設定畫布大小
plt.title('MSE for a normal and fault samples')  # 設定標題

normal_Run = np.random.randint(100)+1  # 隨機取正常資料的其中一回合
print(f'Normal simulationRun = {normal_Run}')
normal_sample = pd.DataFrame(sc.transform(train_normal_df[train_normal_df.simulationRun==normal_Run].iloc[:, 3:]),
                columns=train_normal_df.columns[3:])
normal_preds = predict(models, normal_sample, features_to_predict)  # 透過自定義的 predict 預測結果

fault_Run = np.random.randint(10)+1  # 隨機取異常資料的其中一回合
print(f'fault simulationRun = {fault_Run}')
for idx, i in enumerate(range(20)):
  plt.subplot(10, 2, idx+1)  # 在 10 列 2 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log
  plt.plot(get_mse(normal_sample, normal_preds),
        label='Normal')  # 得到預測結果與原始特徵之間的 MSE 值
  faulty_sample = pd.DataFrame(sc.transform(train_fault_df[(train_fault_df.simulationRun==fault_Run) & (train_fault_df.faultNumber==(i+1))].iloc[:,3:]),
                  columns=train_fault_df.columns[3:])
  faulty_preds = predict(models, faulty_sample, features_to_predict)  # 透過自定義的 predict 預測結果
  plt.plot(get_mse(faulty_sample, faulty_preds),
        label=f'Fault_{i+1}')  # 得到預測結果與原始特徵之間的 MSE 值
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.tight_layout()  # 自動保持子圖之間的間距
Normal simulationRun = 73
fault simulationRun = 2
  • 不同 SimulationRun 的結果
In [87]:
plt.figure(figsize=(25,40))  # 設定畫布大小
plt.title('MSE for a normal and fault samples')  # 設定標題

normal_Run = np.random.randint(100)+1  # 隨機取正常資料的其中一回合
print(f'Normal simulationRun = {normal_Run}')
normal_sample = pd.DataFrame(sc.transform(train_normal_df[train_normal_df.simulationRun==normal_Run].iloc[:, 3:]),
                columns=train_normal_df.columns[3:])
normal_preds = predict(models, normal_sample, features_to_predict)  # 透過自定義的 predict 預測結果

fault_Run = np.random.randint(10)+1  # 隨機取異常資料的其中一回合
print(f'fault simulationRun = {fault_Run}')
for idx, i in enumerate(range(20)):
  plt.subplot(10, 2, idx+1)  # 在 10 列 2 行的畫布上,選擇編號為 idx+1 的位置
  plt.yscale('log')  # 將 y 方向的間距取 log
  plt.plot(get_mse(normal_sample, normal_preds),
        label='Normal')  # 得到預測結果與原始特徵之間的 MSE 值
  faulty_sample = pd.DataFrame(sc.transform(train_fault_df[(train_fault_df.simulationRun==fault_Run) & (train_fault_df.faultNumber==(i+1))].iloc[:,3:]),
                  columns=train_fault_df.columns[3:])
  faulty_preds = predict(models, faulty_sample, features_to_predict)  # 透過自定義的 predict 預測結果
  plt.plot(get_mse(faulty_sample, faulty_preds),
        label=f'Fault_{i+1}')  # 得到預測結果與原始特徵之間的 MSE 值
  plt.axvline(x=20, color='r', linestyle='--')  # 畫垂直線
  plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))  # 加上圖例
  plt.tight_layout()  # 自動保持子圖之間的間距
Normal simulationRun = 32
fault simulationRun = 7
  • 大多數的 Fault 都與 Normal 的 MSE 值有明顯的差距,如何去界定之間的 threshold,可以藉由多次的實驗訂定,或者設定分布的信心水準找到對應的統計量作為分界點。
  • 在 Fault_3, Fault_9, Fault_15 幾乎分布與 Normal 無差別,也有資料顯出此三種異常出現蒐集對應有誤的情況。
  • 此資料集也關乎時間上的關聯性,由 Fault 17 所呈現的圖形明顯看出具有週期性的變化,因此在機器學習的做法當中,可以加入往前推算一段時間的所有特徵,或者時間相關的因素在特徵當中,例如:過往時間段的平均值、最大值、最小值等等。

  • 然而在深度學習的做法當中,有專門為時間序列所設計的 RNN 模型,也是未來可以模型方面嘗試的方向。

資料

  • 資料處理和模型訓練是反覆檢驗的
  • 資料處理應盡可能的與背景知識相呼應
  • 好的且乾淨的資料可以解決大多數的問題

模型

  • 設計特殊模型時要考慮資料輸入資料格式的類型,以及輸出與標籤相應的目標
  • 模型的設計也關乎所擁有的運算資源,追求精準度或者速度
  • 不管是機器學習或深度學習,單純參數的調整可以提升的預測效能有限

簡而言之,決定模型性能的常常是資料的好壞。而資料的好壞又會被資料處理的方式影響。能否找到好的資料處理方式既切合商業目標,並讓模型好學習是資料科學專案的重點。